2023-03-28 11:10:12

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus

Add a sleeper bonus hack, but keep it default disabled. This should
allow easy testing if regressions are due to this.

Specifically; this 'restores' performance for things like starve and
stress-futex, stress-nanosleep that rely on sleeper bonus to compete
against an always running parent (the fair 67%/33% split vs the
50%/50% bonus thing).

OTOH this completely destroys latency and hackbench (as in 5x worse).

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/features.h | 1 +
kernel/sched/sched.h | 3 ++-
3 files changed, 43 insertions(+), 8 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4819,7 +4819,7 @@ static inline void update_misfit_status(
#endif /* CONFIG_SMP */

static void
-place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
+place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
u64 vslice = calc_delta_fair(se->slice, se);
u64 vruntime = avg_vruntime(cfs_rq);
@@ -4878,22 +4878,55 @@ place_entity(struct cfs_rq *cfs_rq, stru
if (WARN_ON_ONCE(!load))
load = 1;
lag = div_s64(lag, load);
+
+ vruntime -= lag;
+ }
+
+ /*
+ * Base the deadline on the 'normal' EEVDF placement policy in an
+ * attempt to not let the bonus crud below wreck things completely.
+ */
+ se->deadline = vruntime;
+
+ /*
+ * The whole 'sleeper' bonus hack... :-/ This is strictly unfair.
+ *
+ * By giving a sleeping task a little boost, it becomes possible for a
+ * 50% task to compete equally with a 100% task. That is, strictly fair
+ * that setup would result in a 67% / 33% split. Sleeper bonus will
+ * change that to 50% / 50%.
+ *
+ * This thing hurts my brain, because tasks leaving with negative lag
+ * will move 'time' backward, so comparing against a historical
+ * se->vruntime is dodgy as heck.
+ */
+ if (sched_feat(PLACE_BONUS) &&
+ (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)) {
+ /*
+ * If se->vruntime is ahead of vruntime, something dodgy
+ * happened and we cannot give bonus due to not having valid
+ * history.
+ */
+ if ((s64)(se->vruntime - vruntime) < 0) {
+ vruntime -= se->slice/2;
+ vruntime = max_vruntime(se->vruntime, vruntime);
+ }
}

- se->vruntime = vruntime - lag;
+ se->vruntime = vruntime;

/*
* When joining the competition; the exisiting tasks will be,
* on average, halfway through their slice, as such start tasks
* off with half a slice to ease into the competition.
*/
- if (sched_feat(PLACE_DEADLINE_INITIAL) && initial)
+ if (sched_feat(PLACE_DEADLINE_INITIAL) && (flags & ENQUEUE_INITIAL))
vslice /= 2;

/*
* EEVDF: vd_i = ve_i + r_i/w_i
*/
- se->deadline = se->vruntime + vslice;
+ se->deadline += vslice;
}

static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
@@ -4910,7 +4943,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
* update_curr().
*/
if (curr)
- place_entity(cfs_rq, se, 0);
+ place_entity(cfs_rq, se, flags);

update_curr(cfs_rq);

@@ -4937,7 +4970,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
* we can place the entity.
*/
if (!curr)
- place_entity(cfs_rq, se, 0);
+ place_entity(cfs_rq, se, flags);

account_entity_enqueue(cfs_rq, se);

@@ -11933,7 +11966,7 @@ static void task_fork_fair(struct task_s
curr = cfs_rq->curr;
if (curr)
update_curr(cfs_rq);
- place_entity(cfs_rq, se, 1);
+ place_entity(cfs_rq, se, ENQUEUE_INITIAL);
rq_unlock(rq, &rf);
}

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_FUDGE, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
+SCHED_FEAT(PLACE_BONUS, false)

/*
* Prefer to schedule the task we woke last (assuming it failed
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2143,7 +2143,7 @@ extern const u32 sched_prio_to_wmult[40
* ENQUEUE_HEAD - place at front of runqueue (tail if not specified)
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
- *
+ * ENQUEUE_INITIAL - place a new task (fork/clone)
*/

#define DEQUEUE_SLEEP 0x01
@@ -2163,6 +2163,7 @@ extern const u32 sched_prio_to_wmult[40
#else
#define ENQUEUE_MIGRATED 0x00
#endif
+#define ENQUEUE_INITIAL 0x80

#define RETRY_TASK ((void *)-1UL)




2023-03-29 09:21:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 15/17] [RFC] sched/eevdf: Sleeper bonus

On Tue, 2023-03-28 at 11:26 +0200, Peter Zijlstra wrote:
> Add a sleeper bonus hack, but keep it default disabled. This should
> allow easy testing if regressions are due to this.
>
> Specifically; this 'restores' performance for things like starve and
> stress-futex, stress-nanosleep that rely on sleeper bonus to compete
> against an always running parent (the fair 67%/33% split vs the
> 50%/50% bonus thing).
>
> OTOH this completely destroys latency and hackbench (as in 5x worse).

I profiled that again, but numbers were still.. not so lovely.

Point of this post is the sleeper/hog split business anyway. I've been
running your patches on my desktop box and cute as button little rpi4b
since they appeared, and poking at them looking for any desktop deltas
and have noticed jack diddly spit.

A lot of benchmarks will notice both distribution and ctx deltas, but
humans.. the numbers I've seen so far say that's highly unlikely.

A couple perf sched lat summaries below for insomniacs.

Load is chrome playing BigBuckBunny (for the zillionth time), which on
this box as I set resolution/size wants ~35% of the box vs 8ms run 1ms
sleep massive_intr, 1 thread per CPU (profiles at ~91%), as a hog-ish
but not absurdly so competitor.

perf.data.stable.full sort=max - top 10 summary
-----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
-----------------------------------------------------------------------------------------------------------
chrome:(7) | 6274.683 ms | 63604 | avg: 0.172 ms | max: 41.796 ms | sum:10930.150 ms |
massive_intr:(8) |1673597.295 ms | 762617 | avg: 0.709 ms | max: 40.383 ms | sum:540374.853 ms |
X:2476 | 86498.438 ms | 129657 | avg: 0.259 ms | max: 36.157 ms | sum:33588.933 ms |
dav1d-worker:(8) | 162369.504 ms | 411962 | avg: 0.682 ms | max: 30.648 ms | sum:280864.249 ms |
ThreadPoolForeg:(13) | 21177.187 ms | 60907 | avg: 0.401 ms | max: 30.424 ms | sum:24412.770 ms |
gmain:(3) | 95.617 ms | 3552 | avg: 0.755 ms | max: 26.365 ms | sum: 2680.738 ms |
llvmpipe-0:(2) | 24602.666 ms | 30828 | avg: 1.278 ms | max: 23.590 ms | sum:39408.811 ms |
llvmpipe-2:(2) | 27707.699 ms | 29226 | avg: 1.236 ms | max: 23.579 ms | sum:36126.717 ms |
llvmpipe-7:(2) | 34437.755 ms | 27017 | avg: 1.097 ms | max: 23.545 ms | sum:29634.448 ms |
llvmpipe-5:(2) | 24533.947 ms | 28503 | avg: 1.375 ms | max: 22.995 ms | sum:39191.132 ms |
-----------------------------------------------------------------------------------------------------------
TOTAL: |2314609.811 ms | 2473891 | 96.4% util, 27.7% GUI 41.796 ms | 1361629.825 ms |
-----------------------------------------------------------------------------------------------------------

perf.data.eevdf.full sort=max - top 10 summary
-----------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms | Max delay ms | Sum delay ms |
-----------------------------------------------------------------------------------------------------------
chrome:(8) | 6329.996 ms | 80080 | avg: 0.193 ms | max: 28.835 ms | sum:15432.012 ms |
ThreadPoolForeg:(20) | 20477.539 ms | 158457 | avg: 0.265 ms | max: 25.708 ms | sum:42003.063 ms |
dav1d-worker:(8) | 168022.569 ms | 1090719 | avg: 0.366 ms | max: 24.786 ms | sum:398971.023 ms |
massive_intr:(8) |1736052.944 ms | 721103 | avg: 0.658 ms | max: 23.427 ms | sum:474493.391 ms |
llvmpipe-5:(2) | 22970.555 ms | 31184 | avg: 1.448 ms | max: 22.465 ms | sum:45148.667 ms |
llvmpipe-3:(2) | 22803.121 ms | 31688 | avg: 1.436 ms | max: 22.076 ms | sum:45516.196 ms |
llvmpipe-0:(2) | 22050.612 ms | 33580 | avg: 1.397 ms | max: 22.007 ms | sum:46898.028 ms |
VizCompositorTh:5538 | 90856.230 ms | 91865 | avg: 0.605 ms | max: 21.702 ms | sum:55542.418 ms |
llvmpipe-1:(2) | 22866.426 ms | 32870 | avg: 1.390 ms | max: 20.732 ms | sum:45690.066 ms |
llvmpipe-2:(2) | 22672.646 ms | 32319 | avg: 1.415 ms | max: 20.647 ms | sum:45731.838 ms |
-----------------------------------------------------------------------------------------------------------
TOTAL: |2332092.393 ms | 3449563 | 97.1% util, 25.6% GUI 28.835 ms | 1570459.986 ms |
-----------------------------------------------------------------------------------------------------------
vs stable 1.394 distribution delta.. meaningless 1.153