2020-06-30 12:01:15

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

This series attempts to address the report that uclamp logic could be expensive
sometimes and shows a regression in netperf UDP_STREAM under certain
conditions.

The first patch is a fix for how struct uclamp_rq is initialized which is
required by the 2nd patch which contains the real 'fix'.

Worth noting that the root cause of the overhead is believed to be system
specific or related to potential certain code/data layout issues, leading to
worse I/D $ performance.

Different systems exhibited different behaviors and the regression did
disappear in certain kernel version while attempting to reporoduce.

More info can be found here:

https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

Having the static key seemed the best thing to do to ensure the effect of
uclamp is minimized for kernels that compile it in but don't have a userspace
that uses it, which will allow distros to distribute uclamp capable kernels by
default without having to compromise on performance for some systems that could
be affected.

Changes in v6:
* s/uclamp_is_enabled/uclamp_is_used/ + add comment
* Improve the bailout condition for the case where we could end up with
unbalanced call of uclamp_rq_dec_id()
* Clarify some comments.

Changes in v5:
* Fix a race that could happen when order of enqueue/dequeue of tasks
A and B is not done in order, and sched_uclamp_used is enabled in
between.
* Add more comments explaining the race and the behavior of
uclamp_rq_util_with() which is now protected with a static key to be
a NOP. When no uclamp aggregation at rq level is done, this function
can't do much.

Changes in v4:
* Fix broken boosting of RT tasks when static key is disabled.

Changes in v3:
* Avoid double negatives and rename the static key to uclamp_used
* Unconditionally enable the static key through any of the paths where
the user can modify the default uclamp value.
* Use C99 named struct initializer for struct uclamp_rq which is easier
to read than the memset().

Changes in v2:
* Add more info in the commit message about the result of perf diff to
demonstrate that the activate/deactivate_task pressure is reduced in
the fast path.

* Fix sparse warning reported by the test robot.

* Add an extra commit about using static_branch_likely() instead of
static_branch_unlikely().

Thanks

--
Qais Yousef

Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
CC: Patrick Bellasi <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: [email protected]


Qais Yousef (2):
sched/uclamp: Fix initialization of struct uclamp_rq
sched/uclamp: Protect uclamp fast path code with static key

kernel/sched/core.c | 95 ++++++++++++++++++++++++++++++--
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/sched.h | 47 +++++++++++++++-
3 files changed, 135 insertions(+), 9 deletions(-)

--
2.17.1


2020-06-30 12:02:21

by Qais Yousef

[permalink] [raw]
Subject: [PATCH v6 1/2] sched/uclamp: Fix initialization of struct uclamp_rq

struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Reviewed-by: Valentin Schneider <[email protected]>
Signed-off-by: Qais Yousef <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Ben Segall <[email protected]>
Cc: Mel Gorman <[email protected]>
CC: Patrick Bellasi <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Lukasz Luba <[email protected]>
Cc: [email protected]
---
kernel/sched/core.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8fe2ac910bed..235b2cae00a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1248,6 +1248,20 @@ static void uclamp_fork(struct task_struct *p)
}
}

+static void __init init_uclamp_rq(struct rq *rq)
+{
+ enum uclamp_id clamp_id;
+ struct uclamp_rq *uc_rq = rq->uclamp;
+
+ for_each_clamp_id(clamp_id) {
+ uc_rq[clamp_id] = (struct uclamp_rq) {
+ .value = uclamp_none(clamp_id)
+ };
+ }
+
+ rq->uclamp_flags = 0;
+}
+
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
@@ -1256,11 +1270,8 @@ static void __init init_uclamp(void)

mutex_init(&uclamp_mutex);

- for_each_possible_cpu(cpu) {
- memset(&cpu_rq(cpu)->uclamp, 0,
- sizeof(struct uclamp_rq)*UCLAMP_CNT);
- cpu_rq(cpu)->uclamp_flags = 0;
- }
+ for_each_possible_cpu(cpu)
+ init_uclamp_rq(cpu_rq(cpu));

for_each_clamp_id(clamp_id) {
uclamp_se_set(&init_task.uclamp_req[clamp_id],
--
2.17.1

2020-07-01 16:34:41

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

Hi Qais,

On 6/30/20 12:21 PM, Qais Yousef wrote:
> This series attempts to address the report that uclamp logic could be expensive
> sometimes and shows a regression in netperf UDP_STREAM under certain
> conditions.
>
> The first patch is a fix for how struct uclamp_rq is initialized which is
> required by the 2nd patch which contains the real 'fix'.
>
> Worth noting that the root cause of the overhead is believed to be system
> specific or related to potential certain code/data layout issues, leading to
> worse I/D $ performance.
>
> Different systems exhibited different behaviors and the regression did
> disappear in certain kernel version while attempting to reporoduce.
>
> More info can be found here:
>
> https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
>
> Having the static key seemed the best thing to do to ensure the effect of
> uclamp is minimized for kernels that compile it in but don't have a userspace
> that uses it, which will allow distros to distribute uclamp capable kernels by
> default without having to compromise on performance for some systems that could
> be affected.
>
> Changes in v6:
> * s/uclamp_is_enabled/uclamp_is_used/ + add comment
> * Improve the bailout condition for the case where we could end up with
> unbalanced call of uclamp_rq_dec_id()
> * Clarify some comments.
>

I've tried this v6 series with mmtest netperf-udp (30x each UDP
size) - the results are good.

v5.7-rc7-base-noucl v5.7-rc7-ucl-tsk-nofix
v5.7-rc7-ucl-tsk-grp-fix_v6
Hmean send-64 62.15 ( 0.00%) 59.65 * -4.02%*
65.87 * 5.99%*
Hmean send-128 122.88 ( 0.00%) 119.37 * -2.85%*
131.76 * 7.23%*
Hmean send-256 244.85 ( 0.00%) 234.26 * -4.32%*
259.87 * 6.14%*
Hmean send-1024 919.24 ( 0.00%) 880.67 * -4.20%*
975.48 * 6.12%*
Hmean send-2048 1689.45 ( 0.00%) 1647.54 * -2.48%*
1797.23 * 6.38%*
Hmean send-3312 2542.36 ( 0.00%) 2485.23 * -2.25%*
2665.69 * 4.85%*
Hmean send-4096 2935.69 ( 0.00%) 2861.09 * -2.54%*
3087.79 * 5.18%*
Hmean send-8192 4800.35 ( 0.00%) 4680.09 * -2.51%*
4966.50 * 3.46%*
Hmean send-16384 7473.66 ( 0.00%) 7349.60 * -1.66%*
7598.49 * 1.67%*
Hmean recv-64 62.15 ( 0.00%) 59.65 * -4.03%*
65.86 * 5.97%*
Hmean recv-128 122.88 ( 0.00%) 119.37 * -2.85%*
131.76 * 7.23%*
Hmean recv-256 244.84 ( 0.00%) 234.26 * -4.32%*
259.80 * 6.11%*
Hmean recv-1024 919.24 ( 0.00%) 880.67 * -4.20%*
975.42 * 6.11%*
Hmean recv-2048 1689.44 ( 0.00%) 1647.54 * -2.48%*
1797.18 * 6.38%*
Hmean recv-3312 2542.36 ( 0.00%) 2485.23 * -2.25%*
2665.53 * 4.84%*
Hmean recv-4096 2935.69 ( 0.00%) 2861.09 * -2.54%*
3087.79 * 5.18%*
Hmean recv-8192 4800.35 ( 0.00%) 4678.15 * -2.55%*
4966.48 * 3.46%*
Hmean recv-16384 7473.63 ( 0.00%) 7349.52 * -1.66%*
7597.90 * 1.66%*

You can add:

Tested-by: Lukasz Luba <[email protected]>

Regards,
Lukasz

2020-07-03 12:11:58

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

On Tue, 30 Jun 2020 at 13:22, Qais Yousef <[email protected]> wrote:
>
> This series attempts to address the report that uclamp logic could be expensive
> sometimes and shows a regression in netperf UDP_STREAM under certain
> conditions.
>
> The first patch is a fix for how struct uclamp_rq is initialized which is
> required by the 2nd patch which contains the real 'fix'.
>
> Worth noting that the root cause of the overhead is believed to be system
> specific or related to potential certain code/data layout issues, leading to
> worse I/D $ performance.
>
> Different systems exhibited different behaviors and the regression did
> disappear in certain kernel version while attempting to reporoduce.
>
> More info can be found here:
>
> https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/
>
> Having the static key seemed the best thing to do to ensure the effect of
> uclamp is minimized for kernels that compile it in but don't have a userspace
> that uses it, which will allow distros to distribute uclamp capable kernels by
> default without having to compromise on performance for some systems that could
> be affected.
>
> Changes in v6:
> * s/uclamp_is_enabled/uclamp_is_used/ + add comment
> * Improve the bailout condition for the case where we could end up with
> unbalanced call of uclamp_rq_dec_id()
> * Clarify some comments.
>
> Changes in v5:
> * Fix a race that could happen when order of enqueue/dequeue of tasks
> A and B is not done in order, and sched_uclamp_used is enabled in
> between.
> * Add more comments explaining the race and the behavior of
> uclamp_rq_util_with() which is now protected with a static key to be
> a NOP. When no uclamp aggregation at rq level is done, this function
> can't do much.
>
> Changes in v4:
> * Fix broken boosting of RT tasks when static key is disabled.
>
> Changes in v3:
> * Avoid double negatives and rename the static key to uclamp_used
> * Unconditionally enable the static key through any of the paths where
> the user can modify the default uclamp value.
> * Use C99 named struct initializer for struct uclamp_rq which is easier
> to read than the memset().
>
> Changes in v2:
> * Add more info in the commit message about the result of perf diff to
> demonstrate that the activate/deactivate_task pressure is reduced in
> the fast path.
>
> * Fix sparse warning reported by the test robot.
>
> * Add an extra commit about using static_branch_likely() instead of
> static_branch_unlikely().
>
> Thanks
>
> --
> Qais Yousef
>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Ben Segall <[email protected]>
> Cc: Mel Gorman <[email protected]>
> CC: Patrick Bellasi <[email protected]>
> Cc: Chris Redpath <[email protected]>
> Cc: Lukasz Luba <[email protected]>
> Cc: [email protected]
>
>
> Qais Yousef (2):
> sched/uclamp: Fix initialization of struct uclamp_rq
> sched/uclamp: Protect uclamp fast path code with static key

I have run the perf bench sched pipe that have have already run
previously with this v6 and the results are similar to my previous
tests:
The impact is -1.61% similarly to v2 which is better compared the
original -3.66% without your patch

>
> kernel/sched/core.c | 95 ++++++++++++++++++++++++++++++--
> kernel/sched/cpufreq_schedutil.c | 2 +-
> kernel/sched/sched.h | 47 +++++++++++++++-
> 3 files changed, 135 insertions(+), 9 deletions(-)
>
> --
> 2.17.1
>

2020-07-06 10:43:58

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

On 07/03/20 14:09, Vincent Guittot wrote:
> I have run the perf bench sched pipe that have have already run
> previously with this v6 and the results are similar to my previous
> tests:
> The impact is -1.61% similarly to v2 which is better compared the
> original -3.66% without your patch

Thanks Vincent.

Can you afford doing a capture of `perf record` and share the resulting
perf.dat with vmlinux (with debug symbols)?

Having a before/after capture would be even better.

Not sure if we can do much about this -1.61% in your case, but it'd be good to
understand why if possible. perf bench sched pipe is very sensitive to tiniest
of changes which could be due to binary-to-binary differences.

Thanks

--
Qais Yousef

2020-07-07 12:32:33

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

On Mon, 6 Jul 2020 at 12:41, Qais Yousef <[email protected]> wrote:
>
> On 07/03/20 14:09, Vincent Guittot wrote:
> > I have run the perf bench sched pipe that have have already run
> > previously with this v6 and the results are similar to my previous
> > tests:
> > The impact is -1.61% similarly to v2 which is better compared the
> > original -3.66% without your patch
>
> Thanks Vincent.
>
> Can you afford doing a capture of `perf record` and share the resulting
> perf.dat with vmlinux (with debug symbols)?

Will try to make it by end of the week

>
> Having a before/after capture would be even better.
>
> Not sure if we can do much about this -1.61% in your case, but it'd be good to
> understand why if possible. perf bench sched pipe is very sensitive to tiniest
> of changes which could be due to binary-to-binary differences.
>
> Thanks
>
> --
> Qais Yousef

2020-07-07 13:12:17

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v6 0/2] sched: Optionally skip uclamp logic in fast path

On 07/07/20 14:29, Vincent Guittot wrote:
> On Mon, 6 Jul 2020 at 12:41, Qais Yousef <[email protected]> wrote:
> >
> > On 07/03/20 14:09, Vincent Guittot wrote:
> > > I have run the perf bench sched pipe that have have already run
> > > previously with this v6 and the results are similar to my previous
> > > tests:
> > > The impact is -1.61% similarly to v2 which is better compared the
> > > original -3.66% without your patch
> >
> > Thanks Vincent.
> >
> > Can you afford doing a capture of `perf record` and share the resulting
> > perf.dat with vmlinux (with debug symbols)?
>
> Will try to make it by end of the week

Thanks! If you want a place to drop them let me know.

Cheers

--
Qais Yousef

>
> >
> > Having a before/after capture would be even better.
> >
> > Not sure if we can do much about this -1.61% in your case, but it'd be good to
> > understand why if possible. perf bench sched pipe is very sensitive to tiniest
> > of changes which could be due to binary-to-binary differences.
> >
> > Thanks
> >
> > --
> > Qais Yousef

Subject: [tip: sched/core] sched/uclamp: Fix initialization of struct uclamp_rq

The following commit has been merged into the sched/core branch of tip:

Commit-ID: d81ae8aac85ca2e307d273f6dc7863a721bf054e
Gitweb: https://git.kernel.org/tip/d81ae8aac85ca2e307d273f6dc7863a721bf054e
Author: Qais Yousef <[email protected]>
AuthorDate: Tue, 30 Jun 2020 12:21:22 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00

sched/uclamp: Fix initialization of struct uclamp_rq

struct uclamp_rq was zeroed out entirely in assumption that in the first
call to uclamp_rq_inc() they'd be initialized correctly in accordance to
default settings.

But when next patch introduces a static key to skip
uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
will fail to perform any frequency changes because the
rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
means all rqs are capped to 0 by default.

Fix it by making sure we do proper initialization at init without
relying on uclamp_rq_inc() doing it later.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
Signed-off-by: Qais Yousef <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Valentin Schneider <[email protected]>
Tested-by: Lukasz Luba <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 15c980a..9605db7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1239,6 +1239,20 @@ static void uclamp_fork(struct task_struct *p)
}
}

+static void __init init_uclamp_rq(struct rq *rq)
+{
+ enum uclamp_id clamp_id;
+ struct uclamp_rq *uc_rq = rq->uclamp;
+
+ for_each_clamp_id(clamp_id) {
+ uc_rq[clamp_id] = (struct uclamp_rq) {
+ .value = uclamp_none(clamp_id)
+ };
+ }
+
+ rq->uclamp_flags = 0;
+}
+
static void __init init_uclamp(void)
{
struct uclamp_se uc_max = {};
@@ -1247,11 +1261,8 @@ static void __init init_uclamp(void)

mutex_init(&uclamp_mutex);

- for_each_possible_cpu(cpu) {
- memset(&cpu_rq(cpu)->uclamp, 0,
- sizeof(struct uclamp_rq)*UCLAMP_CNT);
- cpu_rq(cpu)->uclamp_flags = 0;
- }
+ for_each_possible_cpu(cpu)
+ init_uclamp_rq(cpu_rq(cpu));

for_each_clamp_id(clamp_id) {
uclamp_se_set(&init_task.uclamp_req[clamp_id],