LinuxLists.cc - [PATCH 2/2] sched/eevdf: Use sched_attr::sched

2023-09-15 13:55:28

Subject: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.

For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:

{A,B,C,D}(w=1,r=8):

ABCD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+*------+-------+--- ---+--*----+-------+---

t=2, V=5.5 t=3, V=7.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+----*--+-------+--- ---+------*+-------+---

Note: 4 identical tasks in FIFO order

~~~

{A,B}(w=1,r=16) C(w=2,r=16)

AACCBBCC...
+---+---+---+---

t=0, V=1.25 t=2, V=5.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=4, V=8.25 t=6, V=12.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+-------*-------+--- ---+-------+---*---+---

Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
task doesn't go below q.

Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.

Note: the period of the heavy task is half the full period at:
W*(r_i/w_i) = 4*(2q/2) = 4q

~~~

{A,C,D}(w=1,r=16) B(w=1,r=8):

BAACCBDD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |--------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.5 t=5, V=11.5
A |---------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.5
A |---------------<
B |------<
C |--------------<
D |--------------<
---+-------+----*--+---

Note: 1 short task -- again double r so that the deadline of the short task
won't be below q. Made B short because its not the leftmost task, but is
eligible with the 0,1,2,3 spread.

Note: like with the heavy task, the period of the short task observes:
W*(r_i/w_i) = 4*(1q/1) = 4q

~~~

A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)

BCCAABCC...
+---+---+---+---

t=0, V=1.25 t=1, V=3.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.25 t=5, V=11.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.25
A |--------------<
B |------<
C |------<
---+-------+----*--+---

Note: 1 heavy and 1 short task -- combine them all.

Note: both the short and heavy task end up with a period of 4q

~~~

A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)

BBCAABBC...
+---+---+---+---

t=0, V=1 t=2, V=5
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=3, V=7 t=5, V=11
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=7, V=15
A |--------------<
B |------<
C |------<
---+-------+------*+---

Note: as before but permuted

~~~

>From all this it can be deduced that, for the steady state:

- the total period (P) of a schedule is: W*max(r_i/w_i)
- the average period of a task is: W*(r_i/w_i)
- each task obtains the fair share: w_i/W of each full period P

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 33 ++++++++++++++++++++++++++-------
kernel/sched/fair.c | 6 ++++--
3 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -555,6 +555,9 @@ struct sched_entity {
struct list_head group_node;
unsigned int on_rq;

+ unsigned int custom_slice : 1;
+ /* 31 bits hole */
+
u64 exec_start;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -4501,7 +4501,6 @@ static void __sched_fork(unsigned long c
p->se.nr_migrations = 0;
p->se.vruntime = 0;
p->se.vlag = 0;
- p->se.slice = sysctl_sched_base_slice;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -4755,6 +4754,8 @@ int sched_fork(unsigned long clone_flags

p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);
+ p->se.custom_slice = 0;
+ p->se.slice = sysctl_sched_base_slice;

/*
* We don't need the reset flag anymore after the fork. It has
@@ -7552,10 +7553,20 @@ static void __setscheduler_params(struct

p->policy = policy;

- if (dl_policy(policy))
+ if (dl_policy(policy)) {
__setparam_dl(p, attr);
- else if (fair_policy(policy))
+ } else if (fair_policy(policy)) {
p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+ if (attr->sched_runtime) {
+ p->se.custom_slice = 1;
+ p->se.slice = clamp_t(u64, attr->sched_runtime,
+ NSEC_PER_MSEC/10, /* HZ=1000 * 10 */
+ NSEC_PER_MSEC*100); /* HZ=100 / 10 */
+ } else {
+ p->se.custom_slice = 0;
+ p->se.slice = sysctl_sched_base_slice;
+ }
+ }

/*
* __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -7740,7 +7751,9 @@ recheck:
* but store a possible modification of reset_on_fork.
*/
if (unlikely(policy == p->policy)) {
- if (fair_policy(policy) && attr->sched_nice != task_nice(p))
+ if (fair_policy(policy) &&
+ (attr->sched_nice != task_nice(p) ||
+ (attr->sched_runtime && attr->sched_runtime != p->se.slice)))
goto change;
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
goto change;
@@ -7886,6 +7899,9 @@ static int _sched_setscheduler(struct ta
.sched_nice = PRIO_TO_NICE(p->static_prio),
};

+ if (p->se.custom_slice)
+ attr.sched_runtime = p->se.slice;
+
/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
@@ -8062,12 +8078,14 @@ err_size:

static void get_params(struct task_struct *p, struct sched_attr *attr)
{
- if (task_has_dl_policy(p))
+ if (task_has_dl_policy(p)) {
__getparam_dl(p, attr);
- else if (task_has_rt_policy(p))
+ } else if (task_has_rt_policy(p)) {
attr->sched_priority = p->rt_priority;
- else
+ } else {
attr->sched_nice = task_nice(p);
+ attr->sched_runtime = p->se.slice;
+ }
}

/**
@@ -10086,6 +10104,7 @@ void __init sched_init(void)
}

set_load_weight(&init_task, false);
+ init_task.se.slice = sysctl_sched_base_slice,

/*
* The boot idle thread does lazy MMU switching as well:
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -974,7 +974,8 @@ static void update_deadline(struct cfs_r
* nice) while the request time r_i is determined by
* sysctl_sched_base_slice.
*/
- se->slice = sysctl_sched_base_slice;
+ if (!se->custom_slice)
+ se->slice = sysctl_sched_base_slice;

/*
* EEVDF: vd_i = ve_i + r_i / w_i
@@ -4922,7 +4923,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
u64 vslice, vruntime = avg_vruntime(cfs_rq);
s64 lag = 0;

- se->slice = sysctl_sched_base_slice;
+ if (!se->custom_slice)
+ se->slice = sysctl_sched_base_slice;
vslice = calc_delta_fair(se->slice, se);

/*

2023-09-19 16:44:52

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On Fri, 2023-09-15 at 14:43 +0200, > Allow applications to directly > sched_attr::sched_runtime.

I met an oddity while fiddling cyclictest, it seeming to be reasonable target. working fine per live peek with otherwise virgin source...

homer:..kernel/linux-tip # make
Elsewhere, post build/kill...

instance-a, stock
homer:/root # cyclictest -Smq
# /dev/cpu_dma_latency set to 0us
T: 0 (17763) P: 0 I:1000 C: 297681 Min: T: 1 (17764) P: 0 I:1500 C: 198639 Min: T: 2 (17765) P: 0 I:2000 C: 149075 Min: T: 3 (17766) P: 0 I:2500 C: 119867 Min: T: 4 (17767) P: 0 I:3000 C: 99888 Min: T: 5 (17768) P: 0 I:3500 C: 85748 Min: T: 6 (17769) P: 0 I:4000 C: 75153 Min: T: 7 (17770) P: 0 I:4500 C: 66807 Min:
instance-b, launched w. custom homer:/root # schedtool -v -s PID 17753: PRIO 0, POLICY N: SCHED_NORMAL # /dev/cpu_dma_latency set to 0us
T: 0 (17754) P: 0 I:1000 C: 297014 Min: T: 1 (17755) P: 0 I:1500 C: 198401 Min: T: 2 (17756) P: 0 I:2000 C: 149137 Min: T: 3 (17757) P: 0 I:2500 C: 119519 Min: T: 4 (17758) P: 0 I:3000 C: 99760 Min: T: 5 (17759) P: 0 I:3500 C: 85731 Min: T: 6 (17760) P: 0 I:4000 C: 75321 Min: T: 7 (17761) P: 0 I:4500 C: 66929 Min: Those Avg: numbers follow the
homer:/root # schedtool -v -s PID 29755: PRIO 0, POLICY N: SCHED_NORMAL # /dev/cpu_dma_latency set to 0us
T: 0 (29756) P: 0 I:1000 C: 352348 Min: T: 1 (29757) P: 0 I:1500 C: 229618 Min: T: 2 (29758) P: 0 I:2000 C: 176031 Min: T: 3 (29759) P: 0 I:2500 C: 139346 Min: T: 4 (29760) P: 0 I:3000 C: 117779 Min: T: 5 (29761) P: 0 I:3500 C: 101272 Min: T: 6 (29762) P: 0 I:4000 C: 88781 Min: T: 7 (29763) P: 0 I:4500 C: 78986 Min:
homer:/root # cyclictest -Smq
# /dev/cpu_dma_latency set to 0us
T: 0 (29747) P: 0 I:1000 C: 354262 Min: T: 1 (29748) P: 0 I:1500 C: 236885 Min: T: 2 (29749) P: 0 I:2000 C: 177461 Min: T: 3 (29750) P: 0 I:2500 C: 142315 Min: T: 4 (29751) P: 0 I:3000 C: 118642 Min: T: 5 (29752) P: 0 I:3500 C: 101833 Min: T: 6 (29753) P: 0 I:4000 C: 89065 Min: T: 7 (29754) P: 0 I:4500 C: 79323 Min:
[email protected] wrote:
set a suggested request/slice length using
to see what a custom slice would do for
$subject seems to be
crash, caveat being it's not alone in
-j8 vmlinux && killall cyclictest
2 Act: 51 Avg: 66 Max: 10482
2 Act: 3159 Avg: 70 Max: 9054
2 Act: 52 Avg: 78 Max: 8190
2 Act: 55 Avg: 77 Max: 8328
2 Act: 51 Avg: 90 Max: 8483
3 Act: 53 Avg: 76 Max: 8148
3 Act: 53 Avg: 92 Max: 7510
3 Act: 55 Avg: 81 Max: 8709
slice, and verifies via live peek with crash.
100000 -e cyclictest -Smq
, NICE 0, EEVDF_NICE 0, EEVDF_SLICE 100000, AFFINITY 0xff
1 Act: 51 Avg: 79 Max: 9584
1 Act: 3845 Avg: 118 Max: 9995
1 Act: 103 Avg: 125 Max: 8863
1 Act: 52 Avg: 218 Max: 9704
1 Act: 51 Avg: 134 Max: 11108
1 Act: 53 Avg: 234 Max: 9649
2 Act: 53 Avg: 139 Max: 7351
3 Act: 51 Avg: 191 Max: 6469
^^^ hmm
custom slice.
500000 -e cyclictest -Smq
, NICE 0, EEVDF_NICE 0, EEVDF_SLICE 500000, AFFINITY 0xff
1 Act: 51 Avg: 67 Max: 10259
1 Act: 59 Avg: 121 Max: 8439
1 Act: 54 Avg: 159 Max: 8839
1 Act: 52 Avg: 186 Max: 9498
2 Act: 54 Avg: 172 Max: 8862
1 Act: 54 Avg: 180 Max: 9331
3 Act: 54 Avg: 208 Max: 7111
1 Act: 143 Avg: 168 Max: 6677
2 Act: 51 Avg: 65 Max: 9754
1 Act: 43 Avg: 56 Max: 8434
3 Act: 53 Avg: 75 Max: 9028
2 Act: 53 Avg: 74 Max: 7654
3 Act: 51 Avg: 78 Max: 8169
3 Act: 52 Avg: 75 Max: 7790
3 Act: 52 Avg: 76 Max: 8001
3 Act: 54 Avg: 78 Max: 7703

2023-09-19 22:11:51

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On 09/15/23 14:43, [email protected] wrote:
> Allow applications to directly set a suggested request/slice length using
> sched_attr::sched_runtime.

I'm probably as eternally confused as ever, but is this going to be the latency
hint too? I find it hard to correlate runtime to latency if it is.

>
> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
>
> Applications should strive to use their periodic runtime at a high
> confidence interval (95%+) as the target slice. Using a smaller slice
> will introduce undue preemptions, while using a larger value will
> increase latency.

I can see this being hard to be used in practice. There's portability issue on
defining a runtime that is universal for all systems. Same workload will run
faster on some systems, and slower on others. Applications can't just pick
a value and must do some training to discover the right value for a particular
system. Add to that the weird impact HMP and DVFS can have on runtime from
wakeup to wakeup; things get harder. Shared DVFS policies particularly where
suddenly a task can find itself taking half the runtime because of a busy task
on another CPU doubling your speed.

(slice is not invariant, right?)

And a 95%+ confidence will be hard. A task might not know for sure what it will
do all the time before hand. There could be strong correlation for a short
period of time, but the interactive nature of a lot of workloads make this
hard to be predicted with such high confidence. And those transitions events
are what usually the scheduler struggles to handle well. All history is
suddenly erased and rebuilding it takes time; during which things get messy.

Binder tasks for example can be latency sensitive, but they're not periodic and
will be run on demand when someone asks for something. They're usually short
lived.

Actually so far in Android we just had the notion of something being sensitive
to wake up latency without the need to be specific about it. And if a set of
tasks got stuck on the same rq, they better run first as much as possible. We
did find the need to implement something in the load balancer to spread as
oversubscribe issues are unavoidable. I think the misfit path is the best to
handle this and I'd be happy to send patches in this effect once we land some
interface.

Of course you might find variations of this from different vendors with their
own SDKs for developers.

How do you see the proposed interface fits in this picture? I can't see how to
use it, but maybe I didn't understand it. Assuming of course this is indeed
about latency :-)

Thanks!

--
Qais Yousef

2023-09-20 00:42:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On Tue, Sep 19, 2023 at 11:07:08PM +0100, Qais Yousef wrote:
> On 09/15/23 14:43, [email protected] wrote:
> > Allow applications to directly set a suggested request/slice length using
> > sched_attr::sched_runtime.
>
> I'm probably as eternally confused as ever, but is this going to be the latency
> hint too? I find it hard to correlate runtime to latency if it is.

Yes. Think of it as if a task has to save up for it's slice. Shorter
slice means a shorter time to save up for the time, means it can run
sooner. Longer slice, you get to save up longer.

Some people really want longer slices to reduce cache trashing or
held-lock-preemption like things. Oracle, Facebook, or virt thingies.

Other people just want very short activations but wants them quickly.

2023-09-20 09:34:16

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On 09/20/23 00:37, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 11:07:08PM +0100, Qais Yousef wrote:
> > On 09/15/23 14:43, [email protected] wrote:
> > > Allow applications to directly set a suggested request/slice length using
> > > sched_attr::sched_runtime.
> >
> > I'm probably as eternally confused as ever, but is this going to be the latency
> > hint too? I find it hard to correlate runtime to latency if it is.
>
> Yes. Think of it as if a task has to save up for it's slice. Shorter
> slice means a shorter time to save up for the time, means it can run
> sooner. Longer slice, you get to save up longer.

Okay, so bias toward latency (short runtime) or throughput (long runtime).

I revisited the paper and can appreciate the importance of the term 'request'
in here.

Is the 95%+ confidence part really mandatory? I can easily see runtime swings
between 2-4ms over a trace for example. Should this task request 4ms as runtime
then? If we request 2ms but we ended up needing 4ms, IIUC we'll be preempted
after 2ms as that's what we requested, right?

What is the penalty for lying if we request 4ms but end up needing 2ms?

> Some people really want longer slices to reduce cache trashing or
> held-lock-preemption like things. Oracle, Facebook, or virt thingies.
>
> Other people just want very short activations but wants them quickly.

Is 3-5ms in the very short region? I think that's the average I see. There are
longer, and shorter, but nothing 'crazy' long.

If we have a bunch of very short tasks stuck on the same rq; IIUC the ones that
actually requested the shorter slice should win as the other will still have
sysctl_sched_base_slice as their request, hence the deadline will seem further
away in spite of not consuming their full slice. And somehow lag will sort
itself to ensure fairness if there were too many wake ups of short-request
tasks (latency wakeup storm).

With this interface it'd be sort of compulsory for users to keep their latency
sensitive tasks short; which maybe is a good thing. The question is how short
do they have to be. Is there a number that can be exposed or deduced/calculated
to help identify/advise users to stay within?

Silly question, do you think this interface is transferable if we move away
from EEVDF in the future for whatever reason? I feel I have to reason about how
EEVDF works to use it, which probably was my first stumbling point as I was
thinking in a more detached/abstract manner.

Sorry, too many questions..

Thanks!

--
Qais Yousef

2023-09-24 12:25:47

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

2023-12-10 23:20:27

by Qais Yousef

[permalink] [raw]