LinuxLists.cc - [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

2023-05-31 12:57:07

Subject: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Hi!

Latest version of the EEVDF [1] patches.

The only real change since last time is the fix for tick-preemption [2], and a
simple safe-guard for the mixed slice heuristic.

Other than that, I've re-arranged the patches to make EEVDF come first and have
the latency-nice or slice-attribute patches on top.

Results should not be different from last time around, lots of people ran them
and found no major performance issues; what was found was better latency and
smaller variance (probably due to the more stable latency).

I'm hoping we can start queueing this part.

The big question is what additional interface to expose; some people have
voiced objections to the latency-nice interface, the 'obvious' alternative
is to directly expose the slice length as a request/hint.

The very last patch implements this alternative using sched_attr::sched_runtime
but is untested.

Diffstat for the base patches [1-11]:

include/linux/rbtree_augmented.h | 26 +
include/linux/sched.h | 7 +-
kernel/sched/core.c | 2 +
kernel/sched/debug.c | 48 +-
kernel/sched/fair.c | 1105 ++++++++++++++++++--------------------
kernel/sched/features.h | 24 +-
kernel/sched/sched.h | 16 +-
7 files changed, 587 insertions(+), 641 deletions(-)

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564

[2] https://lkml.kernel.org/r/20230420150537.GC4253%40hirez.programming.kicks-ass.net

2023-05-31 20:10:28

by Qais Yousef

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On 05/31/23 13:58, Peter Zijlstra wrote:
> Hi!
>
> Latest version of the EEVDF [1] patches.
>
> The only real change since last time is the fix for tick-preemption [2], and a
> simple safe-guard for the mixed slice heuristic.
>
> Other than that, I've re-arranged the patches to make EEVDF come first and have
> the latency-nice or slice-attribute patches on top.
>
> Results should not be different from last time around, lots of people ran them
> and found no major performance issues; what was found was better latency and
> smaller variance (probably due to the more stable latency).
>
> I'm hoping we can start queueing this part.
>
> The big question is what additional interface to expose; some people have
> voiced objections to the latency-nice interface, the 'obvious' alternative
> is to directly expose the slice length as a request/hint.

I haven't thought much about this, but..

There are two desired properties form user space that I hope we can achieve
with this new hint:

1. The obvious improve the wake up latencies of some tasks.
2. Hint that there are some tasks that are happy to suffer high latencies.

2 is important because one thing that user space wants to do but we lack
mechanisms to do so is give a hint some tasks are background tasks and can be
shuffled around and preempted more often at the expense of keeping other
tasks more happy.

I'm hoping this + uclamp_max there would be reasonable way to tag these
tasks now so that they consume less power and can be kept 'out of the way' when
necessary.

Slice length seems a bit counter intuitive for these use cases. Maybe expose
the lag? Not sure if this neutral though to allow moving away from EEVDF in the
future if we ever need to. deadline can sound too promising probably.

Also, we had in mind to use these in EAS to avoid packing (though this has to
be re-evaluated if still needed) which is another source of latency. I think
Oracle wanted to use that to control the search depth at load balance IIRC
- not sure if they still want that.

Not sure where we stand from multiple users of this hint now. Should they be
new hints? If so, are we okay with continue to extend sched_setattr() or better
create a proper QoS framework?

As a side note - I started working on some patches to generalize the load
balancer misfit path. Issues we see:

1. latency sensitive tasks ending up on the same CPU can end up suffering. We
need both wake up and load balancer to be aware of this and help spreading.
2. Busy uclamp_max tasks can end up stuck on a wrong core with no ability to
move it around. I still have to see this in practice but it's a concern for
wider deployment (which hasn't happened yet).

I think the misfit path the right way to handle these cases. But I could be
wrong. (or maybe you already take care of this and I just need to better read
the patches). For the wake up side of things I think we need to be careful of
cramming latency sensitive tasks on the same rq if we can avoid it. I didn't
spot that in Vincent patches, neither in yours (which I yet to look better at).

If I may piggy-back a bit more. We seem to reserve a policy for SCHED_ISO,
which AFAICT had an old proposal to better tag interactive tasks. Is this
something waiting for someone to come forward with a proposal for again? I have
to say, the choice of interactive policy looks attractive. I wonder sometimes
if we need to go vertical (new policy/sched_class) instead of horizental - or
maybe both.

Sorry a bit of a brain dump here. HTH.

Cheers

--
Qais Yousef

>
> The very last patch implements this alternative using sched_attr::sched_runtime
> but is untested.
>
> Diffstat for the base patches [1-11]:
>
> include/linux/rbtree_augmented.h | 26 +
> include/linux/sched.h | 7 +-
> kernel/sched/core.c | 2 +
> kernel/sched/debug.c | 48 +-
> kernel/sched/fair.c | 1105 ++++++++++++++++++--------------------
> kernel/sched/features.h | 24 +-
> kernel/sched/sched.h | 16 +-
> 7 files changed, 587 insertions(+), 641 deletions(-)
>
>
> [1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=805acf7726282721504c8f00575d91ebfd750564
>
> [2] https://lkml.kernel.org/r/20230420150537.GC4253%40hirez.programming.kicks-ass.net
>

2023-09-29 16:54:55

by Youssef Esmat

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

>
> EEVDF fundamentally supports per task request/slice sizes, which is the
> primary motivator for finally finishing these patches.
>
> So the plan is to extend sched_setattr() to allow tasks setting their
> own ideal slice length. But we're not quite there yet.
>
> Having just returned from PTO the mailbox is an utter trainwreck, but
> I'll try and refresh those few patches this week for consideration.
>
> In the meantime I think you found the right knob to twiddle.

Hello Peter,

I am trying to understand a little better the need for the eligibility
checks (entity_eligible). I understand the general concept, but I am
trying to find a scenario where it is necessary. And maybe propose to
have it toggled by a feature flag.

Some of my testing:

All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
It was done on the 6.5-rc3 kernel with EEVDF changes ported.

I have two CPU bound tasks one with a nice of -4 and the other with a
nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })

With entity_eligible *enabled* and with entity_eligible *disabled*
(always returns 1):
Top showed consistent results, one at ~70% and the other at ~30%

So it seems the deadline adjustment will naturally achieve fairness.

I also added a few trace_printks to see if there is a case where
entity_eligible would have returned 0 before the deadline forced us to
reschedule. There were a few such cases. The following snippet of
prints shows that an entity became ineligible 2 slices before its
deadline expired. It seems like this will add more context switching
but still achieve a similar result at the end.

bprint: pick_eevdf: eligibility check: tid=4568,
eligible=0, deadline=26577257249, vruntime=26575761118
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26575451399, vruntime=26574838855
sched_switch: prev_comm=loop prev_pid=4568 prev_prio=120
prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=1, deadline=26576270304, vruntime=26575657159
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26576270304, vruntime=26575657159
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577089170, vruntime=26576476006
bprint: pick_eevdf: found best deadline: tid=4573,
deadline=26577089170, vruntime=26576476006
bputs: task_tick_fair: tick
bputs: task_tick_fair: tick
bprint: pick_eevdf: eligibility check: tid=4573,
eligible=0, deadline=26577908042, vruntime=26577294838
bprint: pick_eevdf: found best deadline: tid=4568,
deadline=26577257249, vruntime=26575761118
sched_switch: prev_comm=loop prev_pid=4573 prev_prio=116
prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120

In a more practical example, I tried this with one of our benchmarks
that involves running Meet and Docs side by side and measuring the
input latency in the Docs document. The following is the average
latency for 5 runs:

(These numbers are after removing our cgroup hierarchy - that might be
a discussion for a later time).

CFS: 168ms
EEVDF with eligibility: 206ms (regression from CFS)
EEVDF *without* eligibility: 143ms (improvement to CFS)
EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
104ms (great improvement)

Removing the eligibility check for this workload seemed to result in a
great improvement. I haven't dug deeper but I suspect it's related to
reduced context switches on our 2 core system.
As an extra test I also increased the base_slice_ns and it further
improved the input latency significantly.

I would love to hear your thoughts. Thanks!

2023-10-02 16:09:02

by Youssef Esmat

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Fri, Sep 29, 2023 at 11:54 AM Youssef Esmat
<[email protected]> wrote:
>
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
>
> Hello Peter,
>
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.
>
> Some of my testing:
>
> All my testing was done on a two core Celeron N400 cpu system 1.1Ghz.
> It was done on the 6.5-rc3 kernel with EEVDF changes ported.
>
> I have two CPU bound tasks one with a nice of -4 and the other with a
> nice of 0. They are both affinitized to CPU 0. (while 1 { i++ })
>
> With entity_eligible *enabled* and with entity_eligible *disabled*
> (always returns 1):
> Top showed consistent results, one at ~70% and the other at ~30%
>
> So it seems the deadline adjustment will naturally achieve fairness.
>
> I also added a few trace_printks to see if there is a case where
> entity_eligible would have returned 0 before the deadline forced us to
> reschedule. There were a few such cases. The following snippet of
> prints shows that an entity became ineligible 2 slices before its
> deadline expired. It seems like this will add more context switching
> but still achieve a similar result at the end.
>
> bprint: pick_eevdf: eligibility check: tid=4568,
> eligible=0, deadline=26577257249, vruntime=26575761118
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26575451399, vruntime=26574838855
> sched_switch: prev_comm=loop prev_pid=4568 prev_prio=120
> prev_state=R ==> next_comm=loop next_pid=4573 next_prio=116
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=1, deadline=26576270304, vruntime=26575657159
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26576270304, vruntime=26575657159
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577089170, vruntime=26576476006
> bprint: pick_eevdf: found best deadline: tid=4573,
> deadline=26577089170, vruntime=26576476006
> bputs: task_tick_fair: tick
> bputs: task_tick_fair: tick
> bprint: pick_eevdf: eligibility check: tid=4573,
> eligible=0, deadline=26577908042, vruntime=26577294838
> bprint: pick_eevdf: found best deadline: tid=4568,
> deadline=26577257249, vruntime=26575761118
> sched_switch: prev_comm=loop prev_pid=4573 prev_prio=116
> prev_state=R ==> next_comm=loop next_pid=4568 next_prio=120
>
> In a more practical example, I tried this with one of our benchmarks
> that involves running Meet and Docs side by side and measuring the
> input latency in the Docs document. The following is the average
> latency for 5 runs:
>
> (These numbers are after removing our cgroup hierarchy - that might be
> a discussion for a later time).
>
> CFS: 168ms
> EEVDF with eligibility: 206ms (regression from CFS)
> EEVDF *without* eligibility: 143ms (improvement to CFS)
> EEVDF *without* eligibility and with a 6ms base_slice_ns (was 1.5ms):
> 104ms (great improvement)
>
> Removing the eligibility check for this workload seemed to result in a
> great improvement. I haven't dug deeper but I suspect it's related to
> reduced context switches on our 2 core system.
> As an extra test I also increased the base_slice_ns and it further
> improved the input latency significantly.
>
> I would love to hear your thoughts. Thanks!

For completeness I ran two more tests:

1. EEVDF with eligibility and 6ms base_slice_ns.
2. EEVDF with eligibility with Benjamin Segall's patch
(https://lore.kernel.org/all/[email protected]/).

I copied over all the previous results for easier comparison.

CFS: 168ms
EEVDF, eligib, 1.5ms slice: 206ms
EEVDF, eligib, 6ms slice: 167ms
EEVDF_Fix, eligib, 1.5ms slice: 190ms
EEVDF, *no*eligib, 1.5ms slice: 143ms
EEVDF, *no*eligib, 6ms slice: 104ms

It does seem like Benjamin's fix did have an improvement.

2023-10-02 22:50:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Fri, Sep 29, 2023 at 11:54:25AM -0500, Youssef Esmat wrote:
> >
> > EEVDF fundamentally supports per task request/slice sizes, which is the
> > primary motivator for finally finishing these patches.
> >
> > So the plan is to extend sched_setattr() to allow tasks setting their
> > own ideal slice length. But we're not quite there yet.
> >
> > Having just returned from PTO the mailbox is an utter trainwreck, but
> > I'll try and refresh those few patches this week for consideration.
> >
> > In the meantime I think you found the right knob to twiddle.
>
> Hello Peter,
>
> I am trying to understand a little better the need for the eligibility
> checks (entity_eligible). I understand the general concept, but I am
> trying to find a scenario where it is necessary. And maybe propose to
> have it toggled by a feature flag.

My initial response was that it ensures fairness, but thinking a little
about it I'm not so sure anymore.

I do think it features in section 6 lemma 4 and 5, which provide
interference bounds. But I'd have to think a little more about it.

The current setup, where all requests are of equal size, then virtual
deadline tree is basically identical to the virtual runtime tree, just
transposed (in the static state scenario).

When mixing request sizes things become a little more interesting.

Let me ponder this a little bit more.

2023-10-05 14:34:34

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:

> When mixing request sizes things become a little more interesting.
>
> Let me ponder this a little bit more.

Using the attached program (I got *REALLY* fed up trying to draw these
diagrams by hand), let us illustrate the difference between Earliest
*Eligible* Virtual Deadline First and the one with the Eligible test
taken out: EVDF.

Specifically, the program was used with the following argument for
EEVDF:

./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19

and with an additional '-n' for the EVDF column.

EEVDF EVDF

d = 6 d = 6
d = 2 d = 2
d = 18 d = 18
q = 2 q = 2

t=0 V=1 t=0 V=1
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----

t=2 V=1 t=2 V=1
>A |----< A |----B |<
C |----------------< C |----------------<
|*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----

t=8 V=3 t=4 V=2
A |----< >A |----<
>B |B |<
>C |----------------< C |----------------<
|---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----

t=28 V=10 t=12 V=5
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------*---------|---------|---------|---- |----*----|---------|---------|---------|----

t=30 V=11 t=14 V=5
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|*--------|---------|---------|---- |----*----|---------|---------|---------|----

t=32 V=11 t=16 V=6
A |----< >A |----<
>B |A |----< A |----B |<
C |----------------< C |----------------<
|---------|-*-------|---------|---------|---- |-------*-|---------|---------|---------|----

t=40 V=14 t=24 V=9
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|---*-----|---------|---------|---- |--------*|---------|---------|---------|----

t=42 V=15 t=26 V=9
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|----*----|---------|---------|---- |--------*|---------|---------|---------|----

t=44 V=15 t=28 V=10
A |----< >A |----<
>B |A |----< A |----B |<
C |----------------< C |----------------<
|---------|-----*---|---------|---------|---- |---------|-*-------|---------|---------|----

t=52 V=18 t=36 V=13
A |----< A |----<
>B |C |----------------<
|---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----

t=54 V=19 t=54 V=19
A |----< A |----<
>B |< >B |<
C |----------------< C |----------------<
|---------|--------*|---------|---------|---- |---------|--------*|---------|---------|----

lags: -10, 6 lags: -7, 11

BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB

As I wrote before; EVDF has worse lag bounds, but this is not
insurmountable. The biggest problem that I can see is that of wakeup
preemption. Currently we allow to preempt when 'current' has reached V
(RUN_TO_PARITY in pick_eevdf()).

With these rules, when EEVDF schedules C (our large slice task) at t=10
above, it is only a little behind C and can be reaily preempted after
about 2 time units.

However, EVDF will delay scheduling C until much later, see how A and B
walk far ahead of V until t=36. Only when will we pick C. But this means
that we're firmly stuck with C for at least 11 time units. A newly
placed task will be around V and will have no chance to preempt.

That said, I do have me a patch to cure some of that:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621

That would allow a task with a shorter request time to preempt in spite
of RUN_TO_PARITY.

However, in this example V is only 2/3 of the way to C's deadline, but
it we were to have many more tasks, you'll see V gets closer and closer
to C's deadline and it will become harder and harder to place such that
preemption becomes viable.

Adding 4 more tasks:

./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"

t=92 V=16
A |----C |----------------<
D |<
E |<
F |<
G |<
|---------|-----*---|---------|---------|----

And I worry this will create very real latency spikes.

That said; I do see not having the eligibility check can help. So I'm
not opposed to having a sched_feat for this, but I would not want to
default to EVDF.

Attachments:

(No filename) (10.00 kB)
eevdf.c (4.16 kB)
Download all attachments

2023-10-05 14:47:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
>
> Specifically, the program was used with the following argument for
> EEVDF:
>
> ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
>
> and with an additional '-n' for the EVDF column.
>

<snip a metric ton of diagrams>

>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
>
> That said, I do have me a patch to cure some of that:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
>
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
>

So, doing all that gave me an idea, see queue/sched/eevdf BIAS_ELIGIBLE.

It pushes the eligibility threshold (V) right by one average request.
The below patch is against eevdf.c.

I've not yet ran it through the normal set of hackbench,netperf etc.. so
it might eat your pet and set your granny on fire.

--- eevdf.c.orig 2023-10-05 16:11:35.821114320 +0200
+++ eevdf.c 2023-10-05 16:08:38.387080327 +0200
@@ -7,6 +7,7 @@
#include <sys/param.h>

bool eligible = true;
+bool bias = false;
unsigned long V_lim = 20;

struct entity {
@@ -79,16 +80,17 @@

struct entity *pick_entity(int nr, struct entity *es)
{
- unsigned long W = 0, V = 0;
+ unsigned long W = 0, V = 0, R = 0;
struct entity *e = NULL;

for (int i = 0; i < nr; i++) {
V += es[i].weight * es[i].vruntime;
+ R += es[i].request;
W += es[i].weight;
}

for (int i = 0; i < nr; i++) {
- if (eligible && W*es[i].vruntime > V)
+ if (eligible && W*es[i].vruntime > V + (bias * R))
continue;

if (!e || es[i].vdeadline < e->vdeadline)
@@ -169,10 +171,14 @@

const int N = sizeof(es) / sizeof(es[0]);

- while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
+ while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
unsigned int v,w,r;

switch (opt) {
+ case 'b':
+ bias = true;
+ break;
+
case 'n':
eligible = false;
break;

2023-10-05 15:16:37

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 05, 2023 at 04:14:08PM +0200, Peter Zijlstra wrote:

> --- eevdf.c.orig 2023-10-05 16:11:35.821114320 +0200
> +++ eevdf.c 2023-10-05 16:08:38.387080327 +0200
> @@ -7,6 +7,7 @@
> #include <sys/param.h>
>
> bool eligible = true;
> +bool bias = false;
> unsigned long V_lim = 20;
>
> struct entity {
> @@ -79,16 +80,17 @@
>
> struct entity *pick_entity(int nr, struct entity *es)
> {
> - unsigned long W = 0, V = 0;
> + unsigned long W = 0, V = 0, R = 0;
> struct entity *e = NULL;
>
> for (int i = 0; i < nr; i++) {
> V += es[i].weight * es[i].vruntime;
> + R += es[i].request;

* 1024

Also, average seems too much, one large value lifts it too easily.

Need to come up with something better :/

> W += es[i].weight;
> }
>
> for (int i = 0; i < nr; i++) {
> - if (eligible && W*es[i].vruntime > V)
> + if (eligible && W*es[i].vruntime > V + (bias * R))
> continue;
>
> if (!e || es[i].vdeadline < e->vdeadline)
> @@ -169,10 +171,14 @@
>
> const int N = sizeof(es) / sizeof(es[0]);
>
> - while ((opt = getopt(argc, argv, "nv:e:")) != -1) {
> + while ((opt = getopt(argc, argv, "bnv:e:")) != -1) {
> unsigned int v,w,r;
>
> switch (opt) {
> + case 'b':
> + bias = true;
> + break;
> +
> case 'n':
> eligible = false;
> break;

2023-10-05 18:23:42

by Youssef Esmat

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:
>
> > When mixing request sizes things become a little more interesting.
> >
> > Let me ponder this a little bit more.
>
> Using the attached program (I got *REALLY* fed up trying to draw these
> diagrams by hand), let us illustrate the difference between Earliest
> *Eligible* Virtual Deadline First and the one with the Eligible test
> taken out: EVDF.
>
> Specifically, the program was used with the following argument for
> EEVDF:
>
> ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
>
> and with an additional '-n' for the EVDF column.
>
>
> EEVDF EVDF
>
>
> d = 6 d = 6
> d = 2 d = 2
> d = 18 d = 18
> q = 2 q = 2
>
> t=0 V=1 t=0 V=1
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----
>
>
> t=2 V=1 t=2 V=1
> >A |----< A |----<
> B |< >B |<
> C |----------------< C |----------------<
> |*--------|---------|---------|---------|---- |*--------|---------|---------|---------|----
>
>
> t=8 V=3 t=4 V=2
> A |----< >A |----<
> >B | C |----------------< C |----------------<
> |--*------|---------|---------|---------|---- |-*-------|---------|---------|---------|----
>
>
> t=10 V=4 t=10 V=4
> A |----< A |----<
> B |< >B |<
> >C |----------------< C |----------------<
> |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
>
>
> t=28 V=10 t=12 V=5
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |---------*---------|---------|---------|---- |----*----|---------|---------|---------|----
>
>
> t=30 V=11 t=14 V=5
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |---------|*--------|---------|---------|---- |----*----|---------|---------|---------|----
>
>
> t=32 V=11 t=16 V=6
> A |----< >A |----<
> >B | C |----------------< C |----------------<
> |---------|*--------|---------|---------|---- |-----*---|---------|---------|---------|----
>
>
> t=34 V=12 t=22 V=8
> >A |----< A |----<
> B |< >B |<
> C |----------------< C |----------------<
> |---------|-*-------|---------|---------|---- |-------*-|---------|---------|---------|----
>
>
> t=40 V=14 t=24 V=9
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |---------|---*-----|---------|---------|---- |--------*|---------|---------|---------|----
>
>
> t=42 V=15 t=26 V=9
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |---------|----*----|---------|---------|---- |--------*|---------|---------|---------|----
>
>
> t=44 V=15 t=28 V=10
> A |----< >A |----<
> >B | C |----------------< C |----------------<
> |---------|----*----|---------|---------|---- |---------*---------|---------|---------|----
>
>
> t=46 V=16 t=34 V=12
> >A |----< A |----<
> B |< >B |<
> C |----------------< C |----------------<
> |---------|-----*---|---------|---------|---- |---------|-*-------|---------|---------|----
>
>
> t=52 V=18 t=36 V=13
> A |----< A |----<
> >B | C |----------------< >C |----------------<
> |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
>
>
> t=54 V=19 t=54 V=19
> A |----< A |----<
> >B |< >B |<
> C |----------------< C |----------------<
> |---------|--------*|---------|---------|---- |---------|--------*|---------|---------|----
>
>
> lags: -10, 6 lags: -7, 11
>
> BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
>
>
>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.
>

Thank you for the detailed analysis! I am still in the process of
digesting everything.
I do have a quick question, this will only be the case if we adjust
C's runtime without adjusting nice value, correct? So it does not
currently apply to the submitted code where the only way to change the
deadline is to also change the nice value and thus how fast/slow
vruntime accumulates. In other words without the sched_runtime
patch[1] we should not run into this scenario, correct?

[1] https://lore.kernel.org/lkml/[email protected]/

> That said, I do have me a patch to cure some of that:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
>
> That would allow a task with a shorter request time to preempt in spite
> of RUN_TO_PARITY.
>
> However, in this example V is only 2/3 of the way to C's deadline, but
> it we were to have many more tasks, you'll see V gets closer and closer
> to C's deadline and it will become harder and harder to place such that
> preemption becomes viable.
>
> Adding 4 more tasks:
>
> ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"
>
> t=92 V=16
> A |----<
> B |<
> >C |----------------<
> D |<
> E |<
> F |<
> G |<
> |---------|-----*---|---------|---------|----
>
>
> And I worry this will create very real latency spikes.
>
> That said; I do see not having the eligibility check can help. So I'm
> not opposed to having a sched_feat for this, but I would not want to
> default to EVDF.

2023-10-06 00:36:46

by Youssef Esmat

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 5, 2023 at 1:23 PM Youssef Esmat <[email protected]> wrote:
>
> On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Mon, Oct 02, 2023 at 08:41:36PM +0200, Peter Zijlstra wrote:
> >
> > > When mixing request sizes things become a little more interesting.
> > >
> > > Let me ponder this a little bit more.
> >
> > Using the attached program (I got *REALLY* fed up trying to draw these
> > diagrams by hand), let us illustrate the difference between Earliest
> > *Eligible* Virtual Deadline First and the one with the Eligible test
> > taken out: EVDF.
> >
> > Specifically, the program was used with the following argument for
> > EEVDF:
> >
> > ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19
> >
> > and with an additional '-n' for the EVDF column.
> >

<snip diagrams>

> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> >
>
> Thank you for the detailed analysis! I am still in the process of
> digesting everything.
> I do have a quick question, this will only be the case if we adjust
> C's runtime without adjusting nice value, correct? So it does not
> currently apply to the submitted code where the only way to change the
> deadline is to also change the nice value and thus how fast/slow
> vruntime accumulates. In other words without the sched_runtime
> patch[1] we should not run into this scenario, correct?
>
> [1] https://lore.kernel.org/lkml/[email protected]/

Sorry, to clarify, by "this" I meant "that we're firmly stuck with C
for at least 11 time units".

>
> > That said, I do have me a patch to cure some of that:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=d7edbe431f31762e516f2730196f41322edcc621
> >
> > That would allow a task with a shorter request time to preempt in spite
> > of RUN_TO_PARITY.
> >
> > However, in this example V is only 2/3 of the way to C's deadline, but
> > it we were to have many more tasks, you'll see V gets closer and closer
> > to C's deadline and it will become harder and harder to place such that
> > preemption becomes viable.
> >
> > Adding 4 more tasks:
> >
> > ./eevdf -e "0,1024,6" -e "1,1024,2" -e "2,1024,18" -v 19 -n -e "3,1024,2" -e "4,1024,2" -e "5,1024,2" -e "6,1024,2"
> >
> > t=92 V=16
> > A |----<
> > B |<
> > >C |----------------<
> > D |<
> > E |<
> > F |<
> > G |<
> > |---------|-----*---|---------|---------|----
> >
> >
> > And I worry this will create very real latency spikes.
> >
> > That said; I do see not having the eligibility check can help. So I'm
> > not opposed to having a sched_feat for this, but I would not want to
> > default to EVDF.

2023-10-07 22:05:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:

> t=10 V=4 t=10 V=4
> A |----< A |----<
> B |< >B |<
> >C |----------------< C |----------------<
> |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
>

>
> t=52 V=18 t=36 V=13
> A |----< A |----<
> >B | C |----------------< >C |----------------<
> |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
>

>
> BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
>
>
>
> As I wrote before; EVDF has worse lag bounds, but this is not
> insurmountable. The biggest problem that I can see is that of wakeup
> preemption. Currently we allow to preempt when 'current' has reached V
> (RUN_TO_PARITY in pick_eevdf()).
>
> With these rules, when EEVDF schedules C (our large slice task) at t=10
> above, it is only a little behind C and can be reaily preempted after
> about 2 time units.
>
> However, EVDF will delay scheduling C until much later, see how A and B
> walk far ahead of V until t=36. Only when will we pick C. But this means
> that we're firmly stuck with C for at least 11 time units. A newly
> placed task will be around V and will have no chance to preempt.

Playing around with it a little:

EEVDF EVDF

slice 30000000 slice 30000000
# Min Latencies: 00014 # Min Latencies: 00048
# Avg Latencies: 00692 # Avg Latencies: 188239
# Max Latencies: 94633 # Max Latencies: 961241

slice 3000000 slice 3000000
# Min Latencies: 00054 # Min Latencies: 00055
# Avg Latencies: 00522 # Avg Latencies: 00673
# Max Latencies: 41475 # Max Latencies: 13297

slice 300000 slice 300000
# Min Latencies: 00018 # Min Latencies: 00024
# Avg Latencies: 00344 # Avg Latencies: 00056
# Max Latencies: 20061 # Max Latencies: 00860

So while it improves the short slices, it completely blows up the large
slices -- utterly slaughters the large slices in fact.

And all the many variants of BIAS_ELIGIBLE that I've tried so far only
manage to murder the high end while simultaneously not actually helping
the low end -- so that's a complete write off.

By far the sanest option so far is PLACE_SLEEPER -- and that is very
much not a nice option either :-(

2023-10-09 14:42:38

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Sun, Oct 08, 2023 at 12:04:00AM +0200, Peter Zijlstra wrote:
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
>
> > t=10 V=4 t=10 V=4
> > A |----< A |----<
> > B |< >B |<
> > >C |----------------< C |----------------<
> > |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
> >
>
> >
> > t=52 V=18 t=36 V=13
> > A |----< A |----<
> > >B | > C |----------------< >C |----------------<
> > |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
> >
>
> >
> > BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
>
> Playing around with it a little:
>
> EEVDF EVDF
>
> slice 30000000 slice 30000000
> # Min Latencies: 00014 # Min Latencies: 00048
> # Avg Latencies: 00692 # Avg Latencies: 188239
> # Max Latencies: 94633 # Max Latencies: 961241
>
> slice 3000000 slice 3000000
> # Min Latencies: 00054 # Min Latencies: 00055
> # Avg Latencies: 00522 # Avg Latencies: 00673
> # Max Latencies: 41475 # Max Latencies: 13297
>
> slice 300000 slice 300000
> # Min Latencies: 00018 # Min Latencies: 00024
> # Avg Latencies: 00344 # Avg Latencies: 00056
> # Max Latencies: 20061 # Max Latencies: 00860
>
>
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
>
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
>
>
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

And this can be easily explained by the fact that we insert tasks around
0-lag, so if we delay execution past this point we create an effective
DoS window.

2023-10-10 00:51:53

by Youssef Esmat

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Sat, Oct 7, 2023 at 5:04 PM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 05, 2023 at 02:05:57PM +0200, Peter Zijlstra wrote:
>
> > t=10 V=4 t=10 V=4
> > A |----< A |----<
> > B |< >B |<
> > >C |----------------< C |----------------<
> > |---*-----|---------|---------|---------|---- |---*-----|---------|---------|---------|----
> >
>
> >
> > t=52 V=18 t=36 V=13
> > A |----< A |----<
> > >B | > C |----------------< >C |----------------<
> > |---------|-------*-|---------|---------|---- |---------|--*------|---------|---------|----
> >
>
> >
> > BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
>
> Playing around with it a little:
>
> EEVDF EVDF
>
> slice 30000000 slice 30000000
> # Min Latencies: 00014 # Min Latencies: 00048
> # Avg Latencies: 00692 # Avg Latencies: 188239
> # Max Latencies: 94633 # Max Latencies: 961241
>
> slice 3000000 slice 3000000
> # Min Latencies: 00054 # Min Latencies: 00055
> # Avg Latencies: 00522 # Avg Latencies: 00673
> # Max Latencies: 41475 # Max Latencies: 13297
>
> slice 300000 slice 300000
> # Min Latencies: 00018 # Min Latencies: 00024
> # Avg Latencies: 00344 # Avg Latencies: 00056
> # Max Latencies: 20061 # Max Latencies: 00860
>

Thanks for sharing. Which workload was used to generate these numbers?

I think looking at the sched latency numbers alone does not show the
complete picture. I ran the same input latency test again and tried to
capture some of these numbers for the chrome processes.

EEVDF 1.5ms slice:

Input latency test result: 226ms
perf sched latency:
switches: 1,084,694
avg: 1.139 ms
max: 408.397 ms

EEVDF 6.0ms slice:

Input latency test result: 178ms
perf sched latency:
switches: 892,306
avg: 1.145 ms
max: 354.344 ms

EVDF 6.0ms slice:

Input latency test result: 112ms
perf sched latency:
switches: 134,200
avg: 2.610 ms
max: 374.888 ms

EVDF 6.0ms slice
(no run_to_parity, no place_lag, no place_deadline_initial):

Input latency test result: 110ms
perf sched latency:
switches: 531,656
avg: 0.830 ms
max: 520.463 ms

For our scenario, it is very expensive to interrupt UI threads. It
will increase the input latency significantly. Lowering the scheduling
latency at the cost of switching out important threads can be very
detrimental in this workload. UI and input threads run with a nice
value of -8.

This also seems to match Daniel's message earlier in this thread where
using 12ms base slice improved their benchmarks.

That said, this might not be beneficial for all workloads, and we are
still trying our other workloads out.

>
> So while it improves the short slices, it completely blows up the large
> slices -- utterly slaughters the large slices in fact.
>
> And all the many variants of BIAS_ELIGIBLE that I've tried so far only
> manage to murder the high end while simultaneously not actually helping
> the low end -- so that's a complete write off.
>
>
> By far the sanest option so far is PLACE_SLEEPER -- and that is very
> much not a nice option either :-(

2023-10-10 08:03:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> > Playing around with it a little:
> >
> > EEVDF EVDF
> >
> > slice 30000000 slice 30000000
> > # Min Latencies: 00014 # Min Latencies: 00048
> > # Avg Latencies: 00692 # Avg Latencies: 188239
> > # Max Latencies: 94633 # Max Latencies: 961241
> >
> > slice 3000000 slice 3000000
> > # Min Latencies: 00054 # Min Latencies: 00055
> > # Avg Latencies: 00522 # Avg Latencies: 00673
> > # Max Latencies: 41475 # Max Latencies: 13297
> >
> > slice 300000 slice 300000
> > # Min Latencies: 00018 # Min Latencies: 00024
> > # Avg Latencies: 00344 # Avg Latencies: 00056
> > # Max Latencies: 20061 # Max Latencies: 00860
> >
>
> Thanks for sharing. Which workload was used to generate these numbers?

This is hackbench vs cyclictest, where cyclictest gets a custom slice
set, the big slice is 10 * normal, the middle slice is normal (equal to
hackbench) and the short slice is normal / 10.

2023-10-10 08:09:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

On Thu, Oct 05, 2023 at 01:23:14PM -0500, Youssef Esmat wrote:
> On Thu, Oct 5, 2023 at 7:06 AM Peter Zijlstra <[email protected]> wrote:

> > BAaaBCccccccccBBBAaaBBBAaaBB BBAaaBBBAaaBBBAaaBCccccccccB
> >
> >
> >
> > As I wrote before; EVDF has worse lag bounds, but this is not
> > insurmountable. The biggest problem that I can see is that of wakeup
> > preemption. Currently we allow to preempt when 'current' has reached V
> > (RUN_TO_PARITY in pick_eevdf()).
> >
> > With these rules, when EEVDF schedules C (our large slice task) at t=10
> > above, it is only a little behind C and can be reaily preempted after
> > about 2 time units.
> >
> > However, EVDF will delay scheduling C until much later, see how A and B
> > walk far ahead of V until t=36. Only when will we pick C. But this means
> > that we're firmly stuck with C for at least 11 time units. A newly
> > placed task will be around V and will have no chance to preempt.
> >
>
> Thank you for the detailed analysis! I am still in the process of
> digesting everything.
> I do have a quick question, this will only be the case if we adjust
> C's runtime without adjusting nice value, correct? So it does not
> currently apply to the submitted code where the only way to change the
> deadline is to also change the nice value and thus how fast/slow
> vruntime accumulates. In other words without the sched_runtime
> patch[1] we should not run into this scenario, correct?
>
> [1] https://lore.kernel.org/lkml/[email protected]/

Much harder to run into it, but you can still hit it using nice.

d_i = v_i + r_i / w_i

So for very light tasks, the effective v_deadline goes out lots.

And as I wrote yesterday, by delaying scheduling (far) past 0-lag you
effectively open up a denial of service window, since new tasks are
placed around 0-lag.

Now, nobody will likely care much if very light tasks get horrific
treatment, but still, it's there.

2023-10-16 16:57:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/15] sched: EEVDF and latency-nice and/or slice-attr

Sorry, I seem to have forgotten to reply to this part...

On Mon, Oct 09, 2023 at 07:51:03PM -0500, Youssef Esmat wrote:

> I think looking at the sched latency numbers alone does not show the
> complete picture. I ran the same input latency test again and tried to
> capture some of these numbers for the chrome processes.
>
> EEVDF 1.5ms slice:
>
> Input latency test result: 226ms
> perf sched latency:
> switches: 1,084,694
> avg: 1.139 ms
> max: 408.397 ms
>
> EEVDF 6.0ms slice:
>
> Input latency test result: 178ms
> perf sched latency:
> switches: 892,306
> avg: 1.145 ms
> max: 354.344 ms

> For our scenario, it is very expensive to interrupt UI threads. It
> will increase the input latency significantly. Lowering the scheduling
> latency at the cost of switching out important threads can be very
> detrimental in this workload. UI and input threads run with a nice
> value of -8.

> That said, this might not be beneficial for all workloads, and we are
> still trying our other workloads out.

Right, this seems to suggest something on your critical path (you should
trace that) has more than 3ms of compute in a single activation.

Basically this means chrome is fairly fat on this critical path. But it
seems you know about that.

Anyway, once you know the 95% length of the longest activation on your
critical path, you can indeed set your slice to that. This should be
readily available from trace data.