2023-09-15 13:55:28

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

Allow applications to directly set a suggested request/slice length using
sched_attr::sched_runtime.

The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.

Applications should strive to use their periodic runtime at a high
confidence interval (95%+) as the target slice. Using a smaller slice
will introduce undue preemptions, while using a larger value will
increase latency.

For all the following examples assume a scheduling quantum of 8, and for
consistency all examples have W=4:

{A,B,C,D}(w=1,r=8):

ABCD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+*------+-------+--- ---+--*----+-------+---

t=2, V=5.5 t=3, V=7.5
A |------< A |------<
B |------< B |------<
C |------< C |------<
D |------< D |------<
---+----*--+-------+--- ---+------*+-------+---

Note: 4 identical tasks in FIFO order

~~~

{A,B}(w=1,r=16) C(w=2,r=16)

AACCBBCC...
+---+---+---+---

t=0, V=1.25 t=2, V=5.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=4, V=8.25 t=6, V=12.25
A |--------------< A |--------------<
B |--------------< B |--------------<
C |------< C |------<
---+-------*-------+--- ---+-------+---*---+---

Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2
task doesn't go below q.

Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length.

Note: the period of the heavy task is half the full period at:
W*(r_i/w_i) = 4*(2q/2) = 4q

~~~

{A,C,D}(w=1,r=16) B(w=1,r=8):

BAACCBDD...
+---+---+---+---

t=0, V=1.5 t=1, V=3.5
A |--------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.5 t=5, V=11.5
A |---------------< A |---------------<
B |------< B |------<
C |--------------< C |--------------<
D |--------------< D |--------------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.5
A |---------------<
B |------<
C |--------------<
D |--------------<
---+-------+----*--+---

Note: 1 short task -- again double r so that the deadline of the short task
won't be below q. Made B short because its not the leftmost task, but is
eligible with the 0,1,2,3 spread.

Note: like with the heavy task, the period of the short task observes:
W*(r_i/w_i) = 4*(1q/1) = 4q

~~~

A(w=1,r=16) B(w=1,r=8) C(w=2,r=16)

BCCAABCC...
+---+---+---+---

t=0, V=1.25 t=1, V=3.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+--*----+-------+---

t=3, V=7.25 t=5, V=11.25
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=6, V=13.25
A |--------------<
B |------<
C |------<
---+-------+----*--+---

Note: 1 heavy and 1 short task -- combine them all.

Note: both the short and heavy task end up with a period of 4q

~~~

A(w=1,r=16) B(w=2,r=16) C(w=1,r=8)

BBCAABBC...
+---+---+---+---

t=0, V=1 t=2, V=5
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+*------+-------+--- ---+----*--+-------+---

t=3, V=7 t=5, V=11
A |--------------< A |--------------<
B |------< B |------<
C |------< C |------<
---+------*+-------+--- ---+-------+--*----+---

t=7, V=15
A |--------------<
B |------<
C |------<
---+-------+------*+---

Note: as before but permuted

~~~

>From all this it can be deduced that, for the steady state:

- the total period (P) of a schedule is: W*max(r_i/w_i)
- the average period of a task is: W*(r_i/w_i)
- each task obtains the fair share: w_i/W of each full period P

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 33 ++++++++++++++++++++++++++-------
kernel/sched/fair.c | 6 ++++--
3 files changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -555,6 +555,9 @@ struct sched_entity {
struct list_head group_node;
unsigned int on_rq;

+ unsigned int custom_slice : 1;
+ /* 31 bits hole */
+
u64 exec_start;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -4501,7 +4501,6 @@ static void __sched_fork(unsigned long c
p->se.nr_migrations = 0;
p->se.vruntime = 0;
p->se.vlag = 0;
- p->se.slice = sysctl_sched_base_slice;
INIT_LIST_HEAD(&p->se.group_node);

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -4755,6 +4754,8 @@ int sched_fork(unsigned long clone_flags

p->prio = p->normal_prio = p->static_prio;
set_load_weight(p, false);
+ p->se.custom_slice = 0;
+ p->se.slice = sysctl_sched_base_slice;

/*
* We don't need the reset flag anymore after the fork. It has
@@ -7552,10 +7553,20 @@ static void __setscheduler_params(struct

p->policy = policy;

- if (dl_policy(policy))
+ if (dl_policy(policy)) {
__setparam_dl(p, attr);
- else if (fair_policy(policy))
+ } else if (fair_policy(policy)) {
p->static_prio = NICE_TO_PRIO(attr->sched_nice);
+ if (attr->sched_runtime) {
+ p->se.custom_slice = 1;
+ p->se.slice = clamp_t(u64, attr->sched_runtime,
+ NSEC_PER_MSEC/10, /* HZ=1000 * 10 */
+ NSEC_PER_MSEC*100); /* HZ=100 / 10 */
+ } else {
+ p->se.custom_slice = 0;
+ p->se.slice = sysctl_sched_base_slice;
+ }
+ }

/*
* __sched_setscheduler() ensures attr->sched_priority == 0 when
@@ -7740,7 +7751,9 @@ recheck:
* but store a possible modification of reset_on_fork.
*/
if (unlikely(policy == p->policy)) {
- if (fair_policy(policy) && attr->sched_nice != task_nice(p))
+ if (fair_policy(policy) &&
+ (attr->sched_nice != task_nice(p) ||
+ (attr->sched_runtime && attr->sched_runtime != p->se.slice)))
goto change;
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
goto change;
@@ -7886,6 +7899,9 @@ static int _sched_setscheduler(struct ta
.sched_nice = PRIO_TO_NICE(p->static_prio),
};

+ if (p->se.custom_slice)
+ attr.sched_runtime = p->se.slice;
+
/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
@@ -8062,12 +8078,14 @@ err_size:

static void get_params(struct task_struct *p, struct sched_attr *attr)
{
- if (task_has_dl_policy(p))
+ if (task_has_dl_policy(p)) {
__getparam_dl(p, attr);
- else if (task_has_rt_policy(p))
+ } else if (task_has_rt_policy(p)) {
attr->sched_priority = p->rt_priority;
- else
+ } else {
attr->sched_nice = task_nice(p);
+ attr->sched_runtime = p->se.slice;
+ }
}

/**
@@ -10086,6 +10104,7 @@ void __init sched_init(void)
}

set_load_weight(&init_task, false);
+ init_task.se.slice = sysctl_sched_base_slice,

/*
* The boot idle thread does lazy MMU switching as well:
Index: linux-2.6/kernel/sched/fair.c
===================================================================
--- linux-2.6.orig/kernel/sched/fair.c
+++ linux-2.6/kernel/sched/fair.c
@@ -974,7 +974,8 @@ static void update_deadline(struct cfs_r
* nice) while the request time r_i is determined by
* sysctl_sched_base_slice.
*/
- se->slice = sysctl_sched_base_slice;
+ if (!se->custom_slice)
+ se->slice = sysctl_sched_base_slice;

/*
* EEVDF: vd_i = ve_i + r_i / w_i
@@ -4922,7 +4923,8 @@ place_entity(struct cfs_rq *cfs_rq, stru
u64 vslice, vruntime = avg_vruntime(cfs_rq);
s64 lag = 0;

- se->slice = sysctl_sched_base_slice;
+ if (!se->custom_slice)
+ se->slice = sysctl_sched_base_slice;
vslice = calc_delta_fair(se->slice, se);

/*



2023-09-19 16:44:52

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On Fri, 2023-09-15 at 14:43 +0200, [email protected] wrote:
> Allow applications to directly set a suggested request/slice length using
> sched_attr::sched_runtime.

I met an oddity while fiddling to see what a custom slice would do for
cyclictest, it seeming to be reasonable target. $subject seems to be
working fine per live peek with crash, caveat being it's not alone in
otherwise virgin source...

homer:..kernel/linux-tip # make -j8 vmlinux && killall cyclictest

Elsewhere, post build/kill...

instance-a, stock
homer:/root # cyclictest -Smq
# /dev/cpu_dma_latency set to 0us
T: 0 (17763) P: 0 I:1000 C: 297681 Min: 2 Act: 51 Avg: 66 Max: 10482
T: 1 (17764) P: 0 I:1500 C: 198639 Min: 2 Act: 3159 Avg: 70 Max: 9054
T: 2 (17765) P: 0 I:2000 C: 149075 Min: 2 Act: 52 Avg: 78 Max: 8190
T: 3 (17766) P: 0 I:2500 C: 119867 Min: 2 Act: 55 Avg: 77 Max: 8328
T: 4 (17767) P: 0 I:3000 C: 99888 Min: 2 Act: 51 Avg: 90 Max: 8483
T: 5 (17768) P: 0 I:3500 C: 85748 Min: 3 Act: 53 Avg: 76 Max: 8148
T: 6 (17769) P: 0 I:4000 C: 75153 Min: 3 Act: 53 Avg: 92 Max: 7510
T: 7 (17770) P: 0 I:4500 C: 66807 Min: 3 Act: 55 Avg: 81 Max: 8709

instance-b, launched w. custom slice, and verifies via live peek with crash.
homer:/root # schedtool -v -s 100000 -e cyclictest -Smq
PID 17753: PRIO 0, POLICY N: SCHED_NORMAL , NICE 0, EEVDF_NICE 0, EEVDF_SLICE 100000, AFFINITY 0xff
# /dev/cpu_dma_latency set to 0us
T: 0 (17754) P: 0 I:1000 C: 297014 Min: 1 Act: 51 Avg: 79 Max: 9584
T: 1 (17755) P: 0 I:1500 C: 198401 Min: 1 Act: 3845 Avg: 118 Max: 9995
T: 2 (17756) P: 0 I:2000 C: 149137 Min: 1 Act: 103 Avg: 125 Max: 8863
T: 3 (17757) P: 0 I:2500 C: 119519 Min: 1 Act: 52 Avg: 218 Max: 9704
T: 4 (17758) P: 0 I:3000 C: 99760 Min: 1 Act: 51 Avg: 134 Max: 11108
T: 5 (17759) P: 0 I:3500 C: 85731 Min: 1 Act: 53 Avg: 234 Max: 9649
T: 6 (17760) P: 0 I:4000 C: 75321 Min: 2 Act: 53 Avg: 139 Max: 7351
T: 7 (17761) P: 0 I:4500 C: 66929 Min: 3 Act: 51 Avg: 191 Max: 6469
^^^ hmm
Those Avg: numbers follow the custom slice.

homer:/root # schedtool -v -s 500000 -e cyclictest -Smq
PID 29755: PRIO 0, POLICY N: SCHED_NORMAL , NICE 0, EEVDF_NICE 0, EEVDF_SLICE 500000, AFFINITY 0xff
# /dev/cpu_dma_latency set to 0us
T: 0 (29756) P: 0 I:1000 C: 352348 Min: 1 Act: 51 Avg: 67 Max: 10259
T: 1 (29757) P: 0 I:1500 C: 229618 Min: 1 Act: 59 Avg: 121 Max: 8439
T: 2 (29758) P: 0 I:2000 C: 176031 Min: 1 Act: 54 Avg: 159 Max: 8839
T: 3 (29759) P: 0 I:2500 C: 139346 Min: 1 Act: 52 Avg: 186 Max: 9498
T: 4 (29760) P: 0 I:3000 C: 117779 Min: 2 Act: 54 Avg: 172 Max: 8862
T: 5 (29761) P: 0 I:3500 C: 101272 Min: 1 Act: 54 Avg: 180 Max: 9331
T: 6 (29762) P: 0 I:4000 C: 88781 Min: 3 Act: 54 Avg: 208 Max: 7111
T: 7 (29763) P: 0 I:4500 C: 78986 Min: 1 Act: 143 Avg: 168 Max: 6677

homer:/root # cyclictest -Smq
# /dev/cpu_dma_latency set to 0us
T: 0 (29747) P: 0 I:1000 C: 354262 Min: 2 Act: 51 Avg: 65 Max: 9754
T: 1 (29748) P: 0 I:1500 C: 236885 Min: 1 Act: 43 Avg: 56 Max: 8434
T: 2 (29749) P: 0 I:2000 C: 177461 Min: 3 Act: 53 Avg: 75 Max: 9028
T: 3 (29750) P: 0 I:2500 C: 142315 Min: 2 Act: 53 Avg: 74 Max: 7654
T: 4 (29751) P: 0 I:3000 C: 118642 Min: 3 Act: 51 Avg: 78 Max: 8169
T: 5 (29752) P: 0 I:3500 C: 101833 Min: 3 Act: 52 Avg: 75 Max: 7790
T: 6 (29753) P: 0 I:4000 C: 89065 Min: 3 Act: 52 Avg: 76 Max: 8001
T: 7 (29754) P: 0 I:4500 C: 79323 Min: 3 Act: 54 Avg: 78 Max: 7703

2023-09-19 22:11:51

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On 09/15/23 14:43, [email protected] wrote:
> Allow applications to directly set a suggested request/slice length using
> sched_attr::sched_runtime.

I'm probably as eternally confused as ever, but is this going to be the latency
hint too? I find it hard to correlate runtime to latency if it is.

>
> The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms]
> which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100.
>
> Applications should strive to use their periodic runtime at a high
> confidence interval (95%+) as the target slice. Using a smaller slice
> will introduce undue preemptions, while using a larger value will
> increase latency.

I can see this being hard to be used in practice. There's portability issue on
defining a runtime that is universal for all systems. Same workload will run
faster on some systems, and slower on others. Applications can't just pick
a value and must do some training to discover the right value for a particular
system. Add to that the weird impact HMP and DVFS can have on runtime from
wakeup to wakeup; things get harder. Shared DVFS policies particularly where
suddenly a task can find itself taking half the runtime because of a busy task
on another CPU doubling your speed.

(slice is not invariant, right?)

And a 95%+ confidence will be hard. A task might not know for sure what it will
do all the time before hand. There could be strong correlation for a short
period of time, but the interactive nature of a lot of workloads make this
hard to be predicted with such high confidence. And those transitions events
are what usually the scheduler struggles to handle well. All history is
suddenly erased and rebuilding it takes time; during which things get messy.

Binder tasks for example can be latency sensitive, but they're not periodic and
will be run on demand when someone asks for something. They're usually short
lived.

Actually so far in Android we just had the notion of something being sensitive
to wake up latency without the need to be specific about it. And if a set of
tasks got stuck on the same rq, they better run first as much as possible. We
did find the need to implement something in the load balancer to spread as
oversubscribe issues are unavoidable. I think the misfit path is the best to
handle this and I'd be happy to send patches in this effect once we land some
interface.

Of course you might find variations of this from different vendors with their
own SDKs for developers.

How do you see the proposed interface fits in this picture? I can't see how to
use it, but maybe I didn't understand it. Assuming of course this is indeed
about latency :-)


Thanks!

--
Qais Yousef

2023-09-20 00:42:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On Tue, Sep 19, 2023 at 11:07:08PM +0100, Qais Yousef wrote:
> On 09/15/23 14:43, [email protected] wrote:
> > Allow applications to directly set a suggested request/slice length using
> > sched_attr::sched_runtime.
>
> I'm probably as eternally confused as ever, but is this going to be the latency
> hint too? I find it hard to correlate runtime to latency if it is.

Yes. Think of it as if a task has to save up for it's slice. Shorter
slice means a shorter time to save up for the time, means it can run
sooner. Longer slice, you get to save up longer.

Some people really want longer slices to reduce cache trashing or
held-lock-preemption like things. Oracle, Facebook, or virt thingies.

Other people just want very short activations but wants them quickly.

2023-09-20 09:34:16

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On 09/20/23 00:37, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 11:07:08PM +0100, Qais Yousef wrote:
> > On 09/15/23 14:43, [email protected] wrote:
> > > Allow applications to directly set a suggested request/slice length using
> > > sched_attr::sched_runtime.
> >
> > I'm probably as eternally confused as ever, but is this going to be the latency
> > hint too? I find it hard to correlate runtime to latency if it is.
>
> Yes. Think of it as if a task has to save up for it's slice. Shorter
> slice means a shorter time to save up for the time, means it can run
> sooner. Longer slice, you get to save up longer.

Okay, so bias toward latency (short runtime) or throughput (long runtime).

I revisited the paper and can appreciate the importance of the term 'request'
in here.

Is the 95%+ confidence part really mandatory? I can easily see runtime swings
between 2-4ms over a trace for example. Should this task request 4ms as runtime
then? If we request 2ms but we ended up needing 4ms, IIUC we'll be preempted
after 2ms as that's what we requested, right?

What is the penalty for lying if we request 4ms but end up needing 2ms?

> Some people really want longer slices to reduce cache trashing or
> held-lock-preemption like things. Oracle, Facebook, or virt thingies.
>
> Other people just want very short activations but wants them quickly.

Is 3-5ms in the very short region? I think that's the average I see. There are
longer, and shorter, but nothing 'crazy' long.

If we have a bunch of very short tasks stuck on the same rq; IIUC the ones that
actually requested the shorter slice should win as the other will still have
sysctl_sched_base_slice as their request, hence the deadline will seem further
away in spite of not consuming their full slice. And somehow lag will sort
itself to ensure fairness if there were too many wake ups of short-request
tasks (latency wakeup storm).

With this interface it'd be sort of compulsory for users to keep their latency
sensitive tasks short; which maybe is a good thing. The question is how short
do they have to be. Is there a number that can be exposed or deduced/calculated
to help identify/advise users to stay within?

Silly question, do you think this interface is transferable if we move away
from EEVDF in the future for whatever reason? I feel I have to reason about how
EEVDF works to use it, which probably was my first stumbling point as I was
thinking in a more detached/abstract manner.

Sorry, too many questions..


Thanks!

--
Qais Yousef

2023-09-24 12:25:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On Tue, 2023-09-19 at 09:53 +0200, Mike Galbraith wrote:
> On Fri, 2023-09-15 at 14:43 +0200, [email protected] wrote:
> > Allow applications to directly set a suggested request/slice length using
> > sched_attr::sched_runtime.
>
> I met an oddity while fiddling to see what a custom slice would do for
> cyclictest, it seeming to be reasonable target...

For the record, that cyclictest oddity was the mixed slice handling
improvement patch from the latency-nice branch not doing so well at the
ultralight end of the spectrum. Turning the feature off eliminated it.

Some numbers for the terminally bored below.

-Mike

5 minutes of repeatable mixed load, Blender 1920x1080@24 YouTube clip
(no annoying ads) vs massive_intr (4x88% hogs) with cyclictest -D 300
doing the profile duration timing in dinky/cute rpi4. Filenames
describe slice and feature settings, fudge in filename means feature
was turned on.

perf sched lat -i perf.data.full.6.6.0-rc2-v8.stock --sort=runtime -S 15 -T

----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
massive_intr:(5) | 816065.611 ms | 917839 | avg: 0.255 ms | max: 42.579 ms | sum:234075.309 ms |
chromium-browse:(16) | 59428.589 ms | 138489 | avg: 0.507 ms | max: 28.546 ms | sum:70262.946 ms |
ThreadPoolForeg:(49) | 32962.888 ms | 57147 | avg: 0.916 ms | max: 35.043 ms | sum:52354.576 ms |
mutter:1352 | 24910.265 ms | 52058 | avg: 0.556 ms | max: 21.166 ms | sum:28945.039 ms |
Chrome_ChildIOT:(14) | 23785.517 ms | 132307 | avg: 0.345 ms | max: 30.987 ms | sum:45657.621 ms |
VizCompositorTh:30768 | 14985.421 ms | 64769 | avg: 0.432 ms | max: 24.620 ms | sum:27981.948 ms |
Xorg:925 | 14802.426 ms | 67407 | avg: 0.339 ms | max: 23.912 ms | sum:22844.860 ms |
alsa-sink-MAI P:1260 | 13958.874 ms | 33127 | avg: 0.023 ms | max: 15.023 ms | sum: 756.454 ms |
cyclictest:(5) | 13345.073 ms | 715171 | avg: 0.271 ms | max: 19.277 ms | sum:193861.885 ms |
Media:30834 | 12627.366 ms | 64061 | avg: 0.339 ms | max: 29.811 ms | sum:21687.561 ms |
ThreadPoolSingl:(6) | 9254.163 ms | 43524 | avg: 0.405 ms | max: 21.251 ms | sum:17617.750 ms |
V4L2DecoderThre:30887 | 9251.235 ms | 63002 | avg: 0.302 ms | max: 16.819 ms | sum:19000.280 ms |
VideoFrameCompo:30836 | 7947.943 ms | 47459 | avg: 0.300 ms | max: 19.666 ms | sum:14254.378 ms |
pulseaudio:1172 | 6638.018 ms | 43467 | avg: 0.219 ms | max: 14.621 ms | sum: 9535.951 ms |
threaded-ml:30883 | 5358.744 ms | 29193 | avg: 0.349 ms | max: 12.893 ms | sum:10175.289 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: |1109784.680 ms | 3029507 | | 42.579 ms | 845307.017 ms |
----------------------------------------------------------------------------------------------------------
INFO: 0.001% context switch bugs (13 out of 1909170)

perf sched lat -i perf.data.full.6.6.0-rc2-v8.massive_intr-100ms-slice --sort=runtime -S 15 -T

----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
massive_intr:(5) | 861639.948 ms | 812113 | avg: 0.277 ms | max: 138.492 ms | sum:224707.956 ms |
chromium-browse:(8) | 51307.141 ms | 125222 | avg: 0.530 ms | max: 42.935 ms | sum:66314.290 ms |
ThreadPoolForeg:(16) | 26421.979 ms | 45000 | avg: 0.873 ms | max: 35.752 ms | sum:39306.544 ms |
Chrome_ChildIOT:(5) | 22542.183 ms | 131172 | avg: 0.336 ms | max: 51.751 ms | sum:44091.954 ms |
mutter:1352 | 21835.334 ms | 48508 | avg: 0.566 ms | max: 37.495 ms | sum:27446.519 ms |
VizCompositorTh:39048 | 14531.018 ms | 63787 | avg: 0.463 ms | max: 56.326 ms | sum:29522.218 ms |
Xorg:925 | 14497.447 ms | 67315 | avg: 0.397 ms | max: 36.714 ms | sum:26735.175 ms |
alsa-sink-MAI P:1260 | 13935.472 ms | 33111 | avg: 0.020 ms | max: 6.753 ms | sum: 677.888 ms |
cyclictest:(5) | 12696.835 ms | 653111 | avg: 0.425 ms | max: 38.092 ms | sum:277440.622 ms |
Media:39089 | 12571.118 ms | 67187 | avg: 0.335 ms | max: 26.660 ms | sum:22498.438 ms |
V4L2DecoderThre:39125 | 9156.299 ms | 66378 | avg: 0.301 ms | max: 23.828 ms | sum:19991.504 ms |
ThreadPoolSingl:(4) | 9079.291 ms | 46187 | avg: 0.377 ms | max: 28.850 ms | sum:17422.535 ms |
VideoFrameCompo:39091 | 8103.756 ms | 50025 | avg: 0.290 ms | max: 33.230 ms | sum:14518.688 ms |
pulseaudio:1172 | 6575.897 ms | 44952 | avg: 0.259 ms | max: 19.937 ms | sum:11630.128 ms |
threaded-ml:39123 | 5367.921 ms | 29503 | avg: 0.339 ms | max: 24.648 ms | sum: 9993.313 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: |1127978.646 ms | 2802143 | | 138.492 ms | 920177.765 ms |
----------------------------------------------------------------------------------------------------------
INFO: 0.000% context switch bugs (8 out of 1773282)

perf sched lat -i perf.data.full.6.6.0-rc2-v8.massive_intr-100ms-slice-fudge --sort=runtime -S 15 -T

----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
massive_intr:(5) | 855426.748 ms | 777984 | avg: 0.266 ms | max: 111.756 ms | sum:206723.857 ms |
chromium-browse:(8) | 51656.331 ms | 122631 | avg: 0.531 ms | max: 36.305 ms | sum:65094.280 ms |
ThreadPoolForeg:(16) | 27473.845 ms | 43053 | avg: 0.972 ms | max: 34.973 ms | sum:41833.237 ms |
mutter:1352 | 21412.313 ms | 47476 | avg: 0.541 ms | max: 33.553 ms | sum:25685.892 ms |
Chrome_ChildIOT:(5) | 20283.623 ms | 119424 | avg: 0.395 ms | max: 31.266 ms | sum:47164.449 ms |
VizCompositorTh:36026 | 14643.428 ms | 63979 | avg: 0.464 ms | max: 34.832 ms | sum:29708.794 ms |
Xorg:925 | 14296.586 ms | 67756 | avg: 0.410 ms | max: 23.774 ms | sum:27811.107 ms |
alsa-sink-MAI P:1260 | 13977.823 ms | 33116 | avg: 0.023 ms | max: 5.513 ms | sum: 750.004 ms |
cyclictest:(5) | 12365.030 ms | 645151 | avg: 0.475 ms | max: 35.084 ms | sum:306236.764 ms |
Media:36076 | 12256.872 ms | 60848 | avg: 0.378 ms | max: 26.714 ms | sum:22978.110 ms |
ThreadPoolSingl:(4) | 8983.939 ms | 43538 | avg: 0.401 ms | max: 21.137 ms | sum:17468.417 ms |
V4L2DecoderThre:36101 | 8910.124 ms | 58505 | avg: 0.316 ms | max: 22.654 ms | sum:18503.766 ms |
VideoFrameCompo:36081 | 7851.251 ms | 44655 | avg: 0.325 ms | max: 24.662 ms | sum:14532.619 ms |
pulseaudio:1172 | 6671.226 ms | 44172 | avg: 0.332 ms | max: 19.466 ms | sum:14673.304 ms |
threaded-ml:36099 | 5338.302 ms | 28879 | avg: 0.379 ms | max: 20.754 ms | sum:10944.051 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: |1115349.047 ms | 2703902 | | 111.756 ms | 959507.323 ms |
----------------------------------------------------------------------------------------------------------
INFO: 0.000% context switch bugs (4 out of 1763882)

perf sched lat -i perf.data.full.6.6.0-rc2-v8.cyclictest-500us-slice --sort=runtime -S 15 -T

----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
massive_intr:(5) | 847672.515 ms | 951686 | avg: 0.233 ms | max: 30.660 ms | sum:222000.736 ms |
chromium-browse:(8) | 53628.292 ms | 133414 | avg: 0.456 ms | max: 30.414 ms | sum:60792.606 ms |
ThreadPoolForeg:(17) | 26765.405 ms | 43894 | avg: 0.869 ms | max: 43.268 ms | sum:38131.378 ms |
mutter:1352 | 22161.746 ms | 49865 | avg: 0.514 ms | max: 24.773 ms | sum:25639.654 ms |
Chrome_ChildIOT:(7) | 20878.044 ms | 122989 | avg: 0.315 ms | max: 44.034 ms | sum:38735.550 ms |
Xorg:925 | 14647.685 ms | 66766 | avg: 0.326 ms | max: 16.256 ms | sum:21785.616 ms |
VizCompositorTh:34571 | 14304.582 ms | 64222 | avg: 0.401 ms | max: 23.334 ms | sum:25767.033 ms |
alsa-sink-MAI P:1260 | 14006.042 ms | 33136 | avg: 0.022 ms | max: 4.851 ms | sum: 724.564 ms |
cyclictest:(5) | 13404.228 ms | 731243 | avg: 0.217 ms | max: 28.997 ms | sum:158740.145 ms |
Media:34626 | 12790.442 ms | 63825 | avg: 0.327 ms | max: 30.721 ms | sum:20853.022 ms |
V4L2DecoderThre:34651 | 9278.538 ms | 62619 | avg: 0.284 ms | max: 18.678 ms | sum:17761.070 ms |
ThreadPoolSingl:(4) | 9226.802 ms | 42846 | avg: 0.389 ms | max: 18.563 ms | sum:16684.938 ms |
VideoFrameCompo:34627 | 7788.047 ms | 46681 | avg: 0.285 ms | max: 14.102 ms | sum:13327.258 ms |
pulseaudio:1172 | 6643.612 ms | 42393 | avg: 0.186 ms | max: 7.567 ms | sum: 7873.616 ms |
threaded-ml:34649 | 5402.911 ms | 28737 | avg: 0.313 ms | max: 13.276 ms | sum: 9007.151 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: |1114449.900 ms | 3009626 | | 44.034 ms | 740410.836 ms |
----------------------------------------------------------------------------------------------------------
INFO: 0.000% context switch bugs (8 out of 1873597)

perf sched lat -i perf.data.full.6.6.0-rc2-v8.cyclictest-500us-slice-fudge --sort=runtime -S 15 -T

----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
----------------------------------------------------------------------------------------------------------
massive_intr:(5) | 838981.090 ms | 880877 | avg: 0.262 ms | max: 28.559 ms | sum:230982.334 ms |
chromium-browse:(9) | 55497.942 ms | 128916 | avg: 0.520 ms | max: 39.331 ms | sum:67042.536 ms |
ThreadPoolForeg:(20) | 27669.981 ms | 41272 | avg: 0.990 ms | max: 29.775 ms | sum:40857.370 ms |
mutter:1352 | 22455.332 ms | 47362 | avg: 0.562 ms | max: 19.838 ms | sum:26611.559 ms |
Chrome_ChildIOT:(7) | 22295.707 ms | 123559 | avg: 0.344 ms | max: 29.449 ms | sum:42479.505 ms |
Xorg:925 | 14894.399 ms | 65930 | avg: 0.353 ms | max: 17.650 ms | sum:23303.557 ms |
VizCompositorTh:37170 | 14567.478 ms | 62477 | avg: 0.438 ms | max: 25.008 ms | sum:27366.073 ms |
alsa-sink-MAI P:1260 | 14207.866 ms | 33134 | avg: 0.022 ms | max: 4.092 ms | sum: 744.586 ms |
cyclictest:(5) | 13483.280 ms | 697504 | avg: 0.375 ms | max: 20.859 ms | sum:261834.795 ms |
Media:37224 | 12890.016 ms | 62641 | avg: 0.343 ms | max: 14.333 ms | sum:21510.222 ms |
ThreadPoolSingl:(4) | 9095.635 ms | 42121 | avg: 0.383 ms | max: 17.168 ms | sum:16116.408 ms |
V4L2DecoderThre:37261 | 9079.051 ms | 63220 | avg: 0.291 ms | max: 18.144 ms | sum:18421.175 ms |
VideoFrameCompo:37226 | 8049.344 ms | 46145 | avg: 0.309 ms | max: 18.575 ms | sum:14252.577 ms |
pulseaudio:1172 | 6767.263 ms | 42736 | avg: 0.205 ms | max: 8.202 ms | sum: 8781.668 ms |
threaded-ml:37262 | 5404.490 ms | 28428 | avg: 0.332 ms | max: 17.758 ms | sum: 9424.566 ms |
----------------------------------------------------------------------------------------------------------
TOTAL: |1115092.327 ms | 2896377 | | 39.331 ms | 881443.337 ms |
----------------------------------------------------------------------------------------------------------
INFO: 0.001% context switch bugs (12 out of 1845612)


Peeks out window at lovely sunny Sunday, and <poof>

2023-12-10 23:20:27

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion

On 09/20/23 00:37, Peter Zijlstra wrote:
> On Tue, Sep 19, 2023 at 11:07:08PM +0100, Qais Yousef wrote:
> > On 09/15/23 14:43, [email protected] wrote:
> > > Allow applications to directly set a suggested request/slice length using
> > > sched_attr::sched_runtime.
> >
> > I'm probably as eternally confused as ever, but is this going to be the latency
> > hint too? I find it hard to correlate runtime to latency if it is.
>
> Yes. Think of it as if a task has to save up for it's slice. Shorter
> slice means a shorter time to save up for the time, means it can run
> sooner. Longer slice, you get to save up longer.
>
> Some people really want longer slices to reduce cache trashing or
> held-lock-preemption like things. Oracle, Facebook, or virt thingies.
>
> Other people just want very short activations but wants them quickly.

I did check with several folks around here in the Android world, and none of us
can see how in practice we can use this interface.

It is helpful for those who have a specific system and workload they want to
tune them together. But as a generic app developer interface it will be
impossible to use.

Is that sched-qos thingy worth trying to pursue as an alternative for app
developers? I think from their perspective they just can practically say I care
about running ASAP or not; so a boolean flag to trigger the desire for short
wake up latency. How to implement that, that'd be my pain. But do you see an
issue in principle in trying to go down that route and see how far I (we if
anyone else is interested) can get?

I think the two can co-exist each serving a different purpose.

Or is there something about this interface that makes it usable in this manner
but I couldn't get it?


Thanks!

--
Qais Yousef