2023-03-28 11:09:02

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

In the case where (due to latency-nice) there are different request
sizes in the tree, the smaller requests tend to be dominated by the
larger. Also note how the EEVDF lag limits are based on r_max.

Therefore; add a heuristic that for the mixed request size case, moves
smaller requests to placement strategy #2 which ensures they're
immidiately eligible and and due to their smaller (virtual) deadline
will cause preemption.

NOTE: this relies on update_entity_lag() to impose lag limits above
a single slice.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++++++++
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 16 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
s64 key = entity_key(cfs_rq, se);

cfs_rq->avg_vruntime += key * weight;
+ cfs_rq->avg_slice += se->slice * weight;
cfs_rq->avg_load += weight;
}

@@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
s64 key = entity_key(cfs_rq, se);

cfs_rq->avg_vruntime -= key * weight;
+ cfs_rq->avg_slice -= se->slice * weight;
cfs_rq->avg_load -= weight;
}

@@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
lag = se->vlag;

/*
+ * For latency sensitive tasks; those that have a shorter than
+ * average slice and do not fully consume the slice, transition
+ * to EEVDF placement strategy #2.
+ */
+ if (sched_feat(PLACE_FUDGE) &&
+ cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
+ lag += vslice;
+ if (lag > 0)
+ lag = 0;
+ }
+
+ /*
* If we want to place a task and preserve lag, we have to
* consider the effect of the new entity on the weighted
* average and compensate for this, otherwise lag can quickly
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -5,6 +5,7 @@
* sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
*/
SCHED_FEAT(PLACE_LAG, true)
+SCHED_FEAT(PLACE_FUDGE, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)

/*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -559,6 +559,7 @@ struct cfs_rq {
unsigned int idle_h_nr_running; /* SCHED_IDLE */

s64 avg_vruntime;
+ u64 avg_slice;
u64 avg_load;

u64 exec_clock;



2023-03-31 15:40:18

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <[email protected]> wrote:
>
> In the case where (due to latency-nice) there are different request
> sizes in the tree, the smaller requests tend to be dominated by the
> larger. Also note how the EEVDF lag limits are based on r_max.
>
> Therefore; add a heuristic that for the mixed request size case, moves
> smaller requests to placement strategy #2 which ensures they're
> immidiately eligible and and due to their smaller (virtual) deadline
> will cause preemption.
>
> NOTE: this relies on update_entity_lag() to impose lag limits above
> a single slice.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/fair.c | 14 ++++++++++++++
> kernel/sched/features.h | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 16 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> s64 key = entity_key(cfs_rq, se);
>
> cfs_rq->avg_vruntime += key * weight;
> + cfs_rq->avg_slice += se->slice * weight;
> cfs_rq->avg_load += weight;
> }
>
> @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> s64 key = entity_key(cfs_rq, se);
>
> cfs_rq->avg_vruntime -= key * weight;
> + cfs_rq->avg_slice -= se->slice * weight;
> cfs_rq->avg_load -= weight;
> }
>
> @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> lag = se->vlag;
>
> /*
> + * For latency sensitive tasks; those that have a shorter than
> + * average slice and do not fully consume the slice, transition
> + * to EEVDF placement strategy #2.
> + */
> + if (sched_feat(PLACE_FUDGE) &&
> + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> + lag += vslice;
> + if (lag > 0)
> + lag = 0;

By using different lag policies for tasks, doesn't this create
unfairness between tasks ?

I wanted to stress this situation with a simple use case but it seems
that even without changing the slice, there is a fairness problem:

Task A always run
Task B loops on : running 1ms then sleeping 1ms
default nice and latency nice prio bot both
each task should get around 50% of the time.

The fairness is ok with tip/sched/core
but with eevdf, Task B only gets around 30%

I haven't identified the problem so far


> + }
> +
> + /*
> * If we want to place a task and preserve lag, we have to
> * consider the effect of the new entity on the weighted
> * average and compensate for this, otherwise lag can quickly
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -5,6 +5,7 @@
> * sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
> */
> SCHED_FEAT(PLACE_LAG, true)
> +SCHED_FEAT(PLACE_FUDGE, true)
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>
> /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -559,6 +559,7 @@ struct cfs_rq {
> unsigned int idle_h_nr_running; /* SCHED_IDLE */
>
> s64 avg_vruntime;
> + u64 avg_slice;
> u64 avg_load;
>
> u64 exec_clock;
>
>

2023-04-02 02:54:38

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Sun, 2023-04-02 at 07:23 +0800, Hillf Danton wrote:
> On 31 Mar 2023 17:26:51 +0200 Vincent Guittot <[email protected]>
> >
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> >
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> >
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
>
> Convincing evidence for glitch in wakeup preempt.

If you turn on PLACE_BONUS, it'll mimic FAIR_SLEEPERS.. but if you then
do some testing, you'll probably turn it right back off.

The 50/50 split in current code isn't really any more fair, as soon as
you leave the tiny bubble of fairness, it's not the least bit fair.
Nor is that tiny bubble all rainbows and unicorns, it brought with it
benchmark wins and losses, like everything that changes more than
comments, its price being service latency variance.

The short term split doesn't really mean all that much, some things
will like the current fair-bubble better, some eevdf virtual deadline
math and its less spiky service. We'll see.

I'm kinda hoping eevdf works out, FAIR_SLEEPERS is quite annoying to
squabble with.

-Mike

2023-04-04 09:34:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <[email protected]> wrote:
> >
> > In the case where (due to latency-nice) there are different request
> > sizes in the tree, the smaller requests tend to be dominated by the
> > larger. Also note how the EEVDF lag limits are based on r_max.
> >
> > Therefore; add a heuristic that for the mixed request size case, moves
> > smaller requests to placement strategy #2 which ensures they're
> > immidiately eligible and and due to their smaller (virtual) deadline
> > will cause preemption.
> >
> > NOTE: this relies on update_entity_lag() to impose lag limits above
> > a single slice.
> >
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > kernel/sched/fair.c | 14 ++++++++++++++
> > kernel/sched/features.h | 1 +
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 16 insertions(+)
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> > s64 key = entity_key(cfs_rq, se);
> >
> > cfs_rq->avg_vruntime += key * weight;
> > + cfs_rq->avg_slice += se->slice * weight;
> > cfs_rq->avg_load += weight;
> > }
> >
> > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> > s64 key = entity_key(cfs_rq, se);
> >
> > cfs_rq->avg_vruntime -= key * weight;
> > + cfs_rq->avg_slice -= se->slice * weight;
> > cfs_rq->avg_load -= weight;
> > }
> >
> > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> > lag = se->vlag;
> >
> > /*
> > + * For latency sensitive tasks; those that have a shorter than
> > + * average slice and do not fully consume the slice, transition
> > + * to EEVDF placement strategy #2.
> > + */
> > + if (sched_feat(PLACE_FUDGE) &&
> > + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > + lag += vslice;
> > + if (lag > 0)
> > + lag = 0;
>
> By using different lag policies for tasks, doesn't this create
> unfairness between tasks ?

Possibly, I've just not managed to trigger it yet -- if it is an issue
it can be fixed by ensuring we don't place the entity before its
previous vruntime just like the sleeper hack later on.

> I wanted to stress this situation with a simple use case but it seems
> that even without changing the slice, there is a fairness problem:
>
> Task A always run
> Task B loops on : running 1ms then sleeping 1ms
> default nice and latency nice prio bot both
> each task should get around 50% of the time.
>
> The fairness is ok with tip/sched/core
> but with eevdf, Task B only gets around 30%
>
> I haven't identified the problem so far

Heh, this is actually the correct behaviour. If you have a u=1 and a
u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

CFS has this sleeper bonus hack that makes it 50% vs 50% but that is
strictly not correct -- although it does help a number of weird
benchmarks.

2023-04-04 13:53:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> > On Tue, 28 Mar 2023 at 13:06, Peter Zijlstra <[email protected]> wrote:
> > >
> > > In the case where (due to latency-nice) there are different request
> > > sizes in the tree, the smaller requests tend to be dominated by the
> > > larger. Also note how the EEVDF lag limits are based on r_max.
> > >
> > > Therefore; add a heuristic that for the mixed request size case, moves
> > > smaller requests to placement strategy #2 which ensures they're
> > > immidiately eligible and and due to their smaller (virtual) deadline
> > > will cause preemption.
> > >
> > > NOTE: this relies on update_entity_lag() to impose lag limits above
> > > a single slice.
> > >
> > > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > > ---
> > > kernel/sched/fair.c | 14 ++++++++++++++
> > > kernel/sched/features.h | 1 +
> > > kernel/sched/sched.h | 1 +
> > > 3 files changed, 16 insertions(+)
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -616,6 +616,7 @@ avg_vruntime_add(struct cfs_rq *cfs_rq,
> > > s64 key = entity_key(cfs_rq, se);
> > >
> > > cfs_rq->avg_vruntime += key * weight;
> > > + cfs_rq->avg_slice += se->slice * weight;
> > > cfs_rq->avg_load += weight;
> > > }
> > >
> > > @@ -626,6 +627,7 @@ avg_vruntime_sub(struct cfs_rq *cfs_rq,
> > > s64 key = entity_key(cfs_rq, se);
> > >
> > > cfs_rq->avg_vruntime -= key * weight;
> > > + cfs_rq->avg_slice -= se->slice * weight;
> > > cfs_rq->avg_load -= weight;
> > > }
> > >
> > > @@ -4832,6 +4834,18 @@ place_entity(struct cfs_rq *cfs_rq, stru
> > > lag = se->vlag;
> > >
> > > /*
> > > + * For latency sensitive tasks; those that have a shorter than
> > > + * average slice and do not fully consume the slice, transition
> > > + * to EEVDF placement strategy #2.
> > > + */
> > > + if (sched_feat(PLACE_FUDGE) &&
> > > + cfs_rq->avg_slice > se->slice * cfs_rq->avg_load) {
> > > + lag += vslice;
> > > + if (lag > 0)
> > > + lag = 0;
> >
> > By using different lag policies for tasks, doesn't this create
> > unfairness between tasks ?
>
> Possibly, I've just not managed to trigger it yet -- if it is an issue
> it can be fixed by ensuring we don't place the entity before its
> previous vruntime just like the sleeper hack later on.
>
> > I wanted to stress this situation with a simple use case but it seems
> > that even without changing the slice, there is a fairness problem:
> >
> > Task A always run
> > Task B loops on : running 1ms then sleeping 1ms
> > default nice and latency nice prio bot both
> > each task should get around 50% of the time.
> >
> > The fairness is ok with tip/sched/core
> > but with eevdf, Task B only gets around 30%
> >
> > I haven't identified the problem so far
>
> Heh, this is actually the correct behaviour. If you have a u=1 and a
> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.

Splitting like that sounds like starvation of the sleeper to me. If something
sleeps a lot, it will get even less CPU time on an average than it would if
there was no contention from the u=1 task.

And also CGroups will be even more weird than it already is in such a world,
2 different containers will not get CPU time distributed properly- say if
tasks in one container sleep a lot and tasks in another container are CPU
bound.

thanks,

- Joel

2023-04-05 05:46:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Tue, 2023-04-04 at 13:50 +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 31, 2023 at 05:26:51PM +0200, Vincent Guittot wrote:
> >
> > >
> > > Task A always run
> > > Task B loops on : running 1ms then sleeping 1ms
> > > default nice and latency nice prio bot both
> > > each task should get around 50% of the time.
> > >
> > > The fairness is ok with tip/sched/core
> > > but with eevdf, Task B only gets around 30%
> > >
> > > I haven't identified the problem so far
> >
> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.
>
> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Lets take a quick peek at some group distribution numbers.

start tbench and massive_intr in their own VT (autogroup), then in
another, sleep 300;killall top massive_intr tbench_srv tbench.

(caveman method because perf's refusing to handle fast switchers well
for me.. top's plenty good enough for this anyway, and less intrusive)

massive_intr runs 8ms, sleeps 1, wants 88.8% of 8 runqueues. tbench
buddy pairs want only a tad more CPU, 100% between them, but switch
orders of magnitude more frequently. Very dissimilar breeds of hog.

master.today accrued of 2400s vs master
team massive_intr 1120.50s .466 1.000
team tbench 1256.13s .523 1.000

+eevdf
team massive_intr 1071.94s .446 .956
team tbench 1301.56s .542 1.036

There is of course a distribution delta.. but was it meaningful?

Between mostly idle but kinda noisy GUI perturbing things, and more
importantly, neither load having been manually distributed and pinned,
both schedulers came out pretty good, and both a tad shy of.. perfect
is the enemy of good.

Raw numbers below in case my mouse mucked up feeding of numbers to bc
(blame the inanimate, they can't do a damn thing about it).

6.3.0.g148341f-master
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
5641 root 20 0 2564 640 640 R 50.33 0.004 2:17.19 5 massive_intr
5636 root 20 0 2564 640 640 S 49.00 0.004 2:20.05 5 massive_intr
5647 root 20 0 2564 640 640 R 48.67 0.004 2:21.85 6 massive_intr
5640 root 20 0 2564 640 640 R 48.00 0.004 2:21.13 6 massive_intr
5645 root 20 0 2564 640 640 R 47.67 0.004 2:18.25 5 massive_intr
5638 root 20 0 2564 640 640 R 46.67 0.004 2:22.39 2 massive_intr
5634 root 20 0 2564 640 640 R 45.00 0.004 2:18.93 4 massive_intr
5643 root 20 0 2564 640 640 R 44.00 0.004 2:20.71 7 massive_intr
5639 root 20 0 23468 1664 1536 R 29.00 0.010 1:22.31 3 tbench
5644 root 20 0 23468 1792 1664 R 28.67 0.011 1:22.32 3 tbench
5637 root 20 0 23468 1664 1536 S 28.00 0.010 1:22.75 5 tbench
5631 root 20 0 23468 1792 1664 R 27.00 0.011 1:21.47 4 tbench
5632 root 20 0 23468 1536 1408 R 27.00 0.010 1:21.78 0 tbench
5653 root 20 0 6748 896 768 S 26.67 0.006 1:15.26 3 tbench_srv
5633 root 20 0 23468 1792 1664 R 26.33 0.011 1:22.53 0 tbench
5635 root 20 0 23468 1920 1792 R 26.33 0.012 1:20.72 7 tbench
5642 root 20 0 23468 1920 1792 R 26.00 0.012 1:21.73 2 tbench
5650 root 20 0 6748 768 768 R 25.67 0.005 1:15.71 1 tbench_srv
5652 root 20 0 6748 768 768 S 25.67 0.005 1:15.71 3 tbench_srv
5646 root 20 0 6748 768 768 S 25.33 0.005 1:14.97 4 tbench_srv
5648 root 20 0 6748 896 768 S 25.00 0.006 1:14.66 0 tbench_srv
5651 root 20 0 6748 896 768 S 24.67 0.006 1:14.79 2 tbench_srv
5654 root 20 0 6748 768 768 R 24.33 0.005 1:15.47 0 tbench_srv
5649 root 20 0 6748 768 768 R 24.00 0.005 1:13.95 7 tbench_srv

6.3.0.g148341f-master-eevdf
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
10561 root 20 0 2564 768 640 R 49.83 0.005 2:14.86 3 massive_intr
10562 root 20 0 2564 768 640 R 49.50 0.005 2:14.00 3 massive_intr
10564 root 20 0 2564 896 768 R 49.50 0.006 2:14.11 6 massive_intr
10559 root 20 0 2564 768 640 R 47.84 0.005 2:14.03 2 massive_intr
10560 root 20 0 2564 768 640 R 45.51 0.005 2:13.92 7 massive_intr
10557 root 20 0 2564 896 768 R 44.85 0.006 2:13.59 7 massive_intr
10563 root 20 0 2564 896 768 R 44.85 0.006 2:13.53 6 massive_intr
10558 root 20 0 2564 768 640 R 43.52 0.005 2:13.90 2 massive_intr
10577 root 20 0 23468 1920 1792 R 35.22 0.012 1:37.06 0 tbench
10574 root 20 0 23468 1920 1792 R 32.23 0.012 1:32.89 4 tbench
10580 root 20 0 23468 1920 1792 R 29.57 0.012 1:34.95 0 tbench
10575 root 20 0 23468 1792 1664 R 29.24 0.011 1:31.66 4 tbench
10576 root 20 0 23468 1792 1664 S 28.57 0.011 1:34.55 5 tbench
10573 root 20 0 23468 1792 1664 R 28.24 0.011 1:33.17 5 tbench
10578 root 20 0 23468 1920 1792 S 28.24 0.012 1:33.97 1 tbench
10579 root 20 0 23468 1920 1792 R 28.24 0.012 1:36.09 1 tbench
10587 root 20 0 6748 768 640 S 26.91 0.005 1:09.45 0 tbench_srv
10582 root 20 0 6748 768 640 R 24.25 0.005 1:08.19 4 tbench_srv
10588 root 20 0 6748 640 640 R 22.59 0.004 1:09.15 0 tbench_srv
10583 root 20 0 6748 640 640 R 21.93 0.004 1:07.93 4 tbench_srv
10586 root 20 0 6748 640 640 S 21.59 0.004 1:07.92 1 tbench_srv
10581 root 20 0 6748 640 640 S 21.26 0.004 1:07.08 5 tbench_srv
10585 root 20 0 6748 640 640 R 21.26 0.004 1:08.89 5 tbench_srv
10584 root 20 0 6748 768 640 S 20.93 0.005 1:08.61 1 tbench_srv


2023-04-05 08:40:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:

> > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>
> Splitting like that sounds like starvation of the sleeper to me. If something
> sleeps a lot, it will get even less CPU time on an average than it would if
> there was no contention from the u=1 task.

No, sleeping, per definition, means you're not contending for CPU. What
CFS does, giving them a little boost, is strictly yuck and messes with
latency -- because suddenly you have a task that said it wasn't
competing appear as if it were, but you didn't run it (how could you, it
wasn't there to run) -- but it still needs to catch up.

The reason it does that, is mostly because at the time we didn't want to
do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
computing was still a thing :/ So hacks happened.

That said; I'm starting to regret not pushing the EEVDF thing harder
back in 2010 when I first wrote it :/

> And also CGroups will be even more weird than it already is in such a world,
> 2 different containers will not get CPU time distributed properly- say if
> tasks in one container sleep a lot and tasks in another container are CPU
> bound.

Cgroups are an abomination anyway :-) /me runs like hell. But no, I
don't actually expect too much trouble there.

Or rather, as per the above, time distribution is now more proper than
it was :-)

2023-04-05 20:07:19

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>
> > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> >
> > Splitting like that sounds like starvation of the sleeper to me. If something
> > sleeps a lot, it will get even less CPU time on an average than it would if
> > there was no contention from the u=1 task.
>
> No, sleeping, per definition, means you're not contending for CPU. What
> CFS does, giving them a little boost, is strictly yuck and messes with
> latency -- because suddenly you have a task that said it wasn't
> competing appear as if it were, but you didn't run it (how could you, it
> wasn't there to run) -- but it still needs to catch up.
>
> The reason it does that, is mostly because at the time we didn't want to
> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> computing was still a thing :/ So hacks happened.

Also you have the whole "boost tasks" that sleep a lot with CFS right?
Like a task handling user input sleeps a lot, but when it wakes up,
it gets higher dynamic priority as its vruntime did not advance. I
guess EEVDF also gets you the same thing but still messes with the CPU
usage?

> That said; I'm starting to regret not pushing the EEVDF thing harder
> back in 2010 when I first wrote it :/
>
> > And also CGroups will be even more weird than it already is in such a world,
> > 2 different containers will not get CPU time distributed properly- say if
> > tasks in one container sleep a lot and tasks in another container are CPU
> > bound.
>
> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> don't actually expect too much trouble there.

So, with 2 equally weighted containers, if one has a task that sleeps
50% of the time, and another has a 100% task, then the sleeper will
only run 33% of the time? I can see people running containers having a
problem with that (a customer running one container gets less CPU than
the other.). Sorry if I missed something.

But yeah I do find the whole EEVDF idea interesting but I admit I have
to research it more.

- Joel

2023-04-14 11:28:57

by Phil Auld

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length

On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
> > > On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
> >
> > > > Heh, this is actually the correct behaviour. If you have a u=1 and a
> > > > u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
> > >
> > > Splitting like that sounds like starvation of the sleeper to me. If something
> > > sleeps a lot, it will get even less CPU time on an average than it would if
> > > there was no contention from the u=1 task.
> >
> > No, sleeping, per definition, means you're not contending for CPU. What
> > CFS does, giving them a little boost, is strictly yuck and messes with
> > latency -- because suddenly you have a task that said it wasn't
> > competing appear as if it were, but you didn't run it (how could you, it
> > wasn't there to run) -- but it still needs to catch up.
> >
> > The reason it does that, is mostly because at the time we didn't want to
> > do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
> > computing was still a thing :/ So hacks happened.
>
> Also you have the whole "boost tasks" that sleep a lot with CFS right?
> Like a task handling user input sleeps a lot, but when it wakes up,
> it gets higher dynamic priority as its vruntime did not advance. I
> guess EEVDF also gets you the same thing but still messes with the CPU
> usage?
>
> > That said; I'm starting to regret not pushing the EEVDF thing harder
> > back in 2010 when I first wrote it :/
> >
> > > And also CGroups will be even more weird than it already is in such a world,
> > > 2 different containers will not get CPU time distributed properly- say if
> > > tasks in one container sleep a lot and tasks in another container are CPU
> > > bound.
> >
> > Cgroups are an abomination anyway :-) /me runs like hell. But no, I
> > don't actually expect too much trouble there.
>
> So, with 2 equally weighted containers, if one has a task that sleeps
> 50% of the time, and another has a 100% task, then the sleeper will
> only run 33% of the time? I can see people running containers having a
> problem with that (a customer running one container gets less CPU than
> the other.). Sorry if I missed something.
>

But the 50% sleeper is _asking_ for less CPU. Doing 50% for each would
mean that when the sleeper task was awake it always ran, always won, to
the exclusion of any one else. (Assuming 1 CPU...)

Cheers,
Phil

> But yeah I do find the whole EEVDF idea interesting but I admit I have
> to research it more.
>
> - Joel
>

--

2023-04-16 05:21:06

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 14/17] sched/eevdf: Better handle mixed slice length



> On Apr 14, 2023, at 1:18 PM, Phil Auld <[email protected]> wrote:
>
> On Wed, Apr 05, 2023 at 04:05:55PM -0400 Joel Fernandes wrote:
>>> On Wed, Apr 5, 2023 at 4:36 AM Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Tue, Apr 04, 2023 at 01:50:50PM +0000, Joel Fernandes wrote:
>>>>> On Tue, Apr 04, 2023 at 11:29:36AM +0200, Peter Zijlstra wrote:
>>>
>>>>> Heh, this is actually the correct behaviour. If you have a u=1 and a
>>>>> u=.5 task, you should distribute time on a 2:1 basis, eg. 67% vs 33%.
>>>>
>>>> Splitting like that sounds like starvation of the sleeper to me. If something
>>>> sleeps a lot, it will get even less CPU time on an average than it would if
>>>> there was no contention from the u=1 task.
>>>
>>> No, sleeping, per definition, means you're not contending for CPU. What
>>> CFS does, giving them a little boost, is strictly yuck and messes with
>>> latency -- because suddenly you have a task that said it wasn't
>>> competing appear as if it were, but you didn't run it (how could you, it
>>> wasn't there to run) -- but it still needs to catch up.
>>>
>>> The reason it does that, is mostly because at the time we didn't want to
>>> do the whole lag thing -- it's somewhat heavy on the u64 mults and 32bit
>>> computing was still a thing :/ So hacks happened.
>>
>> Also you have the whole "boost tasks" that sleep a lot with CFS right?
>> Like a task handling user input sleeps a lot, but when it wakes up,
>> it gets higher dynamic priority as its vruntime did not advance. I
>> guess EEVDF also gets you the same thing but still messes with the CPU
>> usage?
>>
>>> That said; I'm starting to regret not pushing the EEVDF thing harder
>>> back in 2010 when I first wrote it :/
>>>
>>>> And also CGroups will be even more weird than it already is in such a world,
>>>> 2 different containers will not get CPU time distributed properly- say if
>>>> tasks in one container sleep a lot and tasks in another container are CPU
>>>> bound.
>>>
>>> Cgroups are an abomination anyway :-) /me runs like hell. But no, I
>>> don't actually expect too much trouble there.
>>
>> So, with 2 equally weighted containers, if one has a task that sleeps
>> 50% of the time, and another has a 100% task, then the sleeper will
>> only run 33% of the time? I can see people running containers having a
>> problem with that (a customer running one container gets less CPU than
>> the other.). Sorry if I missed something.
>>
>
> But the 50% sleeper is _asking_ for less CPU. Doing 50% for each would
> mean that when the sleeper task was awake it always ran, always won, to
> the exclusion of any one else. (Assuming 1 CPU...)
>

It sounds like you are saying that if the task busy looped instead of sleeping, it would get more CPU during the time it is not busy looping but doing some real work. That sounds like encouraging abuse to get more perf.

But again, I have not looked too closely at EEVDF or Peters patches. I was just going by Vincents test and was cautioning to not break users who depend on CFS shares..

Cheers,

- Joel


> Cheers,
> Phil
>
>> But yeah I do find the whole EEVDF idea interesting but I admit I have
>> to research it more.
>>
>> - Joel
>>
>
> --
>