2024-03-25 13:02:52

by K Prateek Nayak

[permalink] [raw]
Subject: [RFC PATCH 0/1] sched/eevdf: Curb wakeup preemption further

Problem Statement
=================

Since EEVDF first landed, DeathStarBench [1] consistently saw a double
digit drop in performance when all the services are running pinned on a
single / multiple CCX(s).

Bisection showed that the regression started at commit 147f3efaa241
("sched/fair: Implement an EEVDF-like scheduling policy") and this was
reported [2]. Further narrowing down than the commit is hard due to the
extent of changes in the commit. EEVDF has seen extensive development
since, but the regression persists.


Narrowing down the problem
==========================

Commit 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling
policy") also added a sched_feat to toggle between EEVDF and legacy CFS.
It was noted that switching to CFS did not completely remedy the
regression. Although some of it could be attributed to the plumbing of
the sched_feat itself, the large enough regression led to audit of the
common path between CFS and EEVDF added since commit 86bfbb7ce4f6
("sched/fair: Add lag based placement"). It was discovered that the
clamping of lag in the offending commit was enough to reproduce the
regression. Specifically, the following change on top of commit
86bfbb7ce4f6 ("sched/fair: Add lag based placement"):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd12ada69b12..93b5e606e8e5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -715,13 +715,20 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
return cfs_rq->min_vruntime + avg;
}

+static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se);
+
/*
* lag_i = S - s_i = w_i * (V - v_i)
*/
void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ s64 lag, limit;
+
SCHED_WARN_ON(!se->on_rq);
- se->vlag = avg_vruntime(cfs_rq) - se->vruntime;
+ lag = avg_vruntime(cfs_rq) - se->vruntime;
+
+ limit = calc_delta_fair(max_t(u64, 2 * sysctl_sched_min_granularity, TICK_NSEC), se);
+ se->vlag = clamp(lag, -limit, limit);
}

static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
--

This turned the attention towards the wakeup path since lag was only
used to determine the vruntime of the entity in place_entity() until
then.


sched-scoreboard analysis
=========================

When analyzing the per-task data from the sched-scoreboard [3], it was
noticed that NGINX server and load-balancer accepting the incoming
requests was being preempted very often. Following are the relative
sched-switch numbers for both algorithms:

86bfbb7ce4f6 (CFS) 147f3efaa241 (EEVDF) %diff
Sum of nr_voluntary_switches 1.00 0.02 -98%
Sum of nr_involuntary_switches 1.00 3.20 220%

Most of the other tasks within the chain of services invoked by the
DeathStarBench workflow saw no significant changes in per-task
statistics except for a drop in number of wakeups proportional to the
drop in the benchmark throughput for the leaf service.

Looking at pick_eevdf() called during wakeup preemption, the eligibility
check for the current entity was identified as the reason for this large
preemption.


Wakeup preemption problem
=========================

With the curr entity's eligibility check, a wakeup preemption is very
likely when an entity with positive lag joins the runqueue pushing the
avg_vruntime of the runqueue backwards, making the vruntime of the
current entity ineligible. This leads to aggressive wakeup preemption
which was previously guarded by wakeup_granularity_ns in legacy CFS.
Below figure depicts one such aggressive preemption scenario with EEVDF:

deadline for Nginx
|
+-------+ | |
/-- | Nginx | -|------------------> |
| +-------+ | |
| |
| -----------|-------------------------------> vruntime timeline
| \--> rq->avg_vruntime
|
| wakes service on the same runqueue since system is busy
|
| +---------+|
\-->| Service || (service has +ve lag pushes avg_vruntime backwards)
+---------+|
| |
wakeup | +--|-----+ |
preempts \---->| N|ginx | -----------> | {deadline for Nginx}
+--|-----+ |
(Nginx ineligible)
-----------|-------------------------------> vruntime timeline
\--> rq->avg_vruntime

When NGINX server is involuntarily switched out, it cannot accept any
incoming request, leading to longer turn around time for the clients and
thus loss in DeathStarBench throughput.

==================================================================
Test : DeathStarBench
Units : Normalized latency
Interpretation: Lower is better
Statistic : Mean
==================================================================
tip 1.00
eevdf 1.14 (+14.61%)


Proposed solution
=================

For current running task, skip eligibility check in pick_eevdf() if it
has not exhausted the slice promised to it during selection despite the
situation having changed since. With the proposed solution, the
performance loss seen with the merge of EEVDF goes away. Following are
the results from testing on a Dual Socket 3rd Generation EPYC server
(2 x 64C/128T):

==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip run-to-parity-wakeup(pct imp)
1CCD 1 1.00 1.16 (%diff: 16%)
2CCD 2 1.00 1.03 (%diff: 3%)
4CCD 4 1.00 1.12 (%diff: 12%)
8CCD 8 1.00 1.05 (%diff: 6%)

With spec_rstack_overflow=off, the DeathStarBench performance with the
proposed solution is same as the performance on v6.5 release before
EEVDF was merged.


Implication
===========

This may lead to newly waking task waiting longer for its turn on the
CPU. However testing on a 3rd Generation Dual Socket EPYC system
(2 x 64C/128T) did not reveal any consistent regressions with the
standard benchmarks:

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1-groups 1.00 [ -0.00]( 2.08) 0.98 [ 2.29]( 3.19)
2-groups 1.00 [ -0.00]( 0.89) 1.02 [ -2.22]( 1.27)
4-groups 1.00 [ -0.00]( 0.81) 1.00 [ 0.30]( 0.97)
8-groups 1.00 [ -0.00]( 0.78) 0.97 [ 2.62]( 0.80)
16-groups 1.00 [ -0.00]( 1.60) 0.96 [ 3.87]( 2.57)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1 1.00 [ 0.00]( 0.71) 1.01 [ 1.13]( 0.20)
2 1.00 [ 0.00]( 0.25) 1.00 [ -0.42]( 0.32)
4 1.00 [ 0.00]( 0.85) 1.00 [ 0.07]( 0.41)
8 1.00 [ 0.00]( 1.00) 0.99 [ -1.02]( 0.57)
16 1.00 [ 0.00]( 1.25) 0.99 [ -1.13]( 1.90)
32 1.00 [ 0.00]( 0.35) 0.98 [ -1.90]( 2.58)
64 1.00 [ 0.00]( 0.71) 0.98 [ -2.20]( 1.33)
128 1.00 [ 0.00]( 0.46) 0.98 [ -1.94]( 0.42)
256 1.00 [ 0.00]( 0.24) 0.98 [ -2.19]( 0.24)
512 1.00 [ 0.00]( 0.30) 0.99 [ -0.87]( 0.15)
1024 1.00 [ 0.00]( 0.40) 1.00 [ -0.29]( 0.75)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
Copy 1.00 [ 0.00]( 9.73) 1.01 [ 0.53]( 6.33)
Scale 1.00 [ 0.00]( 5.57) 1.00 [ -0.17]( 3.89)
Add 1.00 [ 0.00]( 5.43) 1.00 [ 0.48]( 3.63)
Triad 1.00 [ 0.00]( 5.50) 0.98 [ -1.90]( 6.31)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
Copy 1.00 [ 0.00]( 3.26) 0.99 [ -0.78]( 2.94)
Scale 1.00 [ 0.00]( 1.26) 0.97 [ -2.86]( 6.48)
Add 1.00 [ 0.00]( 1.47) 0.99 [ -1.34]( 4.20)
Triad 1.00 [ 0.00]( 1.77) 1.01 [ 0.53]( 2.23)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.22) 1.02 [ 2.26]( 1.08)
2-clients 1.00 [ 0.00]( 0.57) 1.02 [ 2.02]( 1.03)
4-clients 1.00 [ 0.00]( 0.43) 1.02 [ 2.35]( 0.64)
8-clients 1.00 [ 0.00]( 0.27) 1.02 [ 2.32]( 0.72)
16-clients 1.00 [ 0.00]( 0.46) 1.02 [ 1.55]( 0.73)
32-clients 1.00 [ 0.00]( 0.95) 1.00 [ -0.46]( 1.47)
64-clients 1.00 [ 0.00]( 1.79) 0.97 [ -2.72]( 1.34)
128-clients 1.00 [ 0.00]( 0.89) 1.00 [ 0.25]( 0.89)
256-clients 1.00 [ 0.00]( 8.68) 0.95 [ -4.96]( 7.11) *
512-clients 1.00 [ 0.00](35.06) 1.00 [ -0.43](51.55)

==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) run-to-parity-wakeup[pct imp](CV)
1 1.00 [ -0.00](27.28) 0.69 [ 31.25](41.89)
2 1.00 [ -0.00]( 3.85) 1.00 [ -0.00](11.47)
4 1.00 [ -0.00](14.00) 1.21 [-21.05]( 7.20) *
8 1.00 [ -0.00]( 4.68) 1.04 [ -4.17](13.10) *
16 1.00 [ -0.00]( 4.08) 0.98 [ 1.61]( 1.64)
32 1.00 [ -0.00]( 6.68) 0.94 [ 6.12]( 9.27)
64 1.00 [ -0.00]( 1.79) 0.98 [ 1.53]( 2.36)
128 1.00 [ -0.00]( 6.30) 1.03 [ -3.16]( 8.92) *
256 1.00 [ -0.00](43.39) 1.36 [-35.94](12.43) *
512 1.00 [ -0.00]( 2.26) 0.78 [ 22.04]( 1.77)

==================================================================
Test : SPECjbb (Without any scheduler tunable fiddling)
Units : Throughput
Interpretation: Lower is better
Statistic : Mean
==================================================================
Metric tip run-to-parity-wakeup
Max-jOPS 1.00 1.01 (+1.21%)
Critical-jOPS 1.00 1.01 (+1.05%)

* Points of large regression also show large run-to-run variance on both
tip:sched/core and with the proposed solution.


Future work and alternatives
============================

If wakeup latency is indeed a concern, this solution can be paired with
"sched/eevdf: Allow shorter slices to wakeup-preempt" patch from Peter's
EEVDF tree [4] to allow for latency sensitive tasks to preempt the
current running task with "PREEMPT_SHORT" feat superseding
"RUN_TO_PARITY_WAKEUP" decision for such tasks.

If not "RUN_TO_PARITY_WAKEUP" a tunable, similar to
"wakeup_granularity_ns" might need to be re-introduced to curb such
scenarios with EEVDF.

Tobias seems to be dealing with other side of the coin where a newly
woken task is not being scheduled in fast enough on the CPU [5]. It'll
be good to discuss how to keep both sides happy, or discover a middle
ground where both are less unhappy :)


References
==========

[1] https://github.com/delimitrou/DeathStarBench/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://github.com/AMDESE/sched-scoreboard/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=c2594f33fb1e1f4a2a4d962c464cb63a27a8f3d9 (Prone to force-update)
[5] https://lore.kernel.org/lkml/[email protected]/

The patch applies cleanly on tip:sched/core at commit b9e6e2866392
("sched/fair: Fix typos in comments")
---
K Prateek Nayak (1):
sched/eevdf: Skip eligibility check for current entity during wakeup
preemption

kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/features.h | 1 +
2 files changed, 21 insertions(+), 4 deletions(-)

--
2.34.1



2024-03-25 13:02:56

by K Prateek Nayak

[permalink] [raw]
Subject: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

With the curr entity's eligibility check, a wakeup preemption is very
likely when an entity with positive lag joins the runqueue pushing the
avg_vruntime of the runqueue backwards, making the vruntime of the
current entity ineligible. This leads to aggressive wakeup preemption
which was previously guarded by wakeup_granularity_ns in legacy CFS.
Below figure depicts one such aggressive preemption scenario with EEVDF
in DeathStarBench [1]:

deadline for Nginx
|
+-------+ | |
/-- | Nginx | -|------------------> |
| +-------+ | |
| |
| -----------|-------------------------------> vruntime timeline
| \--> rq->avg_vruntime
|
| wakes service on the same runqueue since system is busy
|
| +---------+|
\-->| Service || (service has +ve lag pushes avg_vruntime backwards)
+---------+|
| |
wakeup | +--|-----+ |
preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
+--|-----+ |
(Nginx ineligible)
-----------|-------------------------------> vruntime timeline
\--> rq->avg_vruntime

When NGINX server is involuntarily switched out, it cannot accept any
incoming request, leading to longer turn around time for the clients and
thus loss in DeathStarBench throughput.

==================================================================
Test : DeathStarBench
Units : Normalized latency
Interpretation: Lower is better
Statistic : Mean
==================================================================
tip 1.00
eevdf 1.14 (+14.61%)

For current running task, skip eligibility check in pick_eevdf() if it
has not exhausted the slice promised to it during selection despite the
situation having changed since. The behavior is guarded by
RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
since the merge of EEVDF disappears. Following are the results from
testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):

==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip run-to-parity-wakeup(pct imp)
1CCD 1 1.00 1.16 (%diff: 16%)
2CCD 2 1.00 1.03 (%diff: 3%)
4CCD 4 1.00 1.12 (%diff: 12%)
8CCD 8 1.00 1.05 (%diff: 6%)

With spec_rstack_overflow=off, the DeathStarBench performance with the
proposed solution is same as the performance on v6.5 release before
EEVDF was merged.

This may lead to newly waking task waiting longer for its turn on the
CPU, however, testing on the same system did not reveal any consistent
regressions with the standard benchmarks.

Link: https://github.com/delimitrou/DeathStarBench/ [1]
Signed-off-by: K Prateek Nayak <[email protected]>
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
kernel/sched/features.h | 1 +
2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..a9b145a4eab0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
*
* Which allows tree pruning through eligibility.
*/
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *se = __pick_first_entity(cfs_rq);
@@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
if (cfs_rq->nr_running == 1)
return curr && curr->on_rq ? curr : se;

- if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
+ if (curr && !curr->on_rq)
+ curr = NULL;
+
+ /*
+ * When an entity with positive lag wakes up, it pushes the
+ * avg_vruntime of the runqueue backwards. This may causes the
+ * current entity to be ineligible soon into its run leading to
+ * wakeup preemption.
+ *
+ * To prevent such aggressive preemption of the current running
+ * entity during task wakeups, skip the eligibility check if the
+ * slice promised to the entity since its selection has not yet
+ * elapsed.
+ */
+ if (curr &&
+ !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
+ !entity_eligible(cfs_rq, curr))
curr = NULL;

/*
@@ -5460,7 +5476,7 @@ pick_next_entity(struct cfs_rq *cfs_rq)
cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
return cfs_rq->next;

- return pick_eevdf(cfs_rq);
+ return pick_eevdf(cfs_rq, false);
}

static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@ -8340,7 +8356,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq, true) == pse)
goto preempt;

return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..027bab5b4031 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)

/*
* Prefer to schedule the task we woke last (assuming it failed
--
2.34.1


2024-03-25 16:53:38

by Youssef Esmat

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

On Mon, Mar 25, 2024 at 1:03 AM K Prateek Nayak <kprateek.nayak@amdcom> wrote:
>
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.
> Below figure depicts one such aggressive preemption scenario with EEVDF
> in DeathStarBench [1]:
>
> deadline for Nginx
> |
> +-------+ | |
> /-- | Nginx | -|------------------> |
> | +-------+ | |
> | |
> | -----------|-------------------------------> vruntime timeline
> | \--> rq->avg_vruntime
> |
> | wakes service on the same runqueue since system is busy
> |
> | +---------+|
> \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
> +---------+|
> | |
> wakeup | +--|-----+ |
> preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
> +--|-----+ |
> (Nginx ineligible)
> -----------|-------------------------------> vruntime timeline
> \--> rq->avg_vruntime
>
> When NGINX server is involuntarily switched out, it cannot accept any
> incoming request, leading to longer turn around time for the clients and
> thus loss in DeathStarBench throughput.
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized latency
> Interpretation: Lower is better
> Statistic : Mean
> ==================================================================
> tip 1.00
> eevdf 1.14 (+14.61%)
>
> For current running task, skip eligibility check in pick_eevdf() if it
> has not exhausted the slice promised to it during selection despite the
> situation having changed since. The behavior is guarded by
> RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
> RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
> since the merge of EEVDF disappears. Following are the results from
> testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):
>
> ==================================================================
> Test : DeathStarBench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : Mean
> ==================================================================
> Pinning scaling tip run-to-parity-wakeup(pct imp)
> 1CCD 1 1.00 1.16 (%diff: 16%)
> 2CCD 2 1.00 1.03 (%diff: 3%)
> 4CCD 4 1.00 1.12 (%diff: 12%)
> 8CCD 8 1.00 1.05 (%diff: 6%)
>
> With spec_rstack_overflow=off, the DeathStarBench performance with the
> proposed solution is same as the performance on v6.5 release before
> EEVDF was merged.

Thanks for sharing this Prateek.
We actually noticed we could also gain performance by disabling
eligibility checks (but disable it on all paths).
The following are a few threads we had on the topic:

Discussion around eligibility:
https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@mail.gmail.com/
Some of our results:
https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@mail.gmail.com/
Sched feature to disable eligibility:
https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromiumorg/

>
> This may lead to newly waking task waiting longer for its turn on the
> CPU, however, testing on the same system did not reveal any consistent
> regressions with the standard benchmarks.
>
> Link: https://github.com/delimitrou/DeathStarBench/ [1]
> Signed-off-by: K Prateek Nayak <[email protected]>
> ---
> kernel/sched/fair.c | 24 ++++++++++++++++++++----
> kernel/sched/features.h | 1 +
> 2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a9b145a4eab0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> *
> * Which allows tree pruning through eligibility.
> */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
> {
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (cfs_rq->nr_running == 1)
> return curr && curr->on_rq ? curr : se;
>
> - if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> + if (curr && !curr->on_rq)
> + curr = NULL;
> +
> + /*
> + * When an entity with positive lag wakes up, it pushes the
> + * avg_vruntime of the runqueue backwards. This may causes the
> + * current entity to be ineligible soon into its run leading to
> + * wakeup preemption.
> + *
> + * To prevent such aggressive preemption of the current running
> + * entity during task wakeups, skip the eligibility check if the
> + * slice promised to the entity since its selection has not yet
> + * elapsed.
> + */
> + if (curr &&
> + !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
> + !entity_eligible(cfs_rq, curr))
> curr = NULL;
>
> /*
> @@ -5460,7 +5476,7 @@ pick_next_entity(struct cfs_rq *cfs_rq)
> cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
> return cfs_rq->next;
>
> - return pick_eevdf(cfs_rq);
> + return pick_eevdf(cfs_rq, false);
> }
>
> static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
> @@ -8340,7 +8356,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> /*
> * XXX pick_eevdf(cfs_rq) != se ?
> */
> - if (pick_eevdf(cfs_rq) == pse)
> + if (pick_eevdf(cfs_rq, true) == pse)
> goto preempt;
>
> return;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..027bab5b4031 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
> SCHED_FEAT(PLACE_LAG, true)
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(RUN_TO_PARITY_WAKEUP, true)
>
> /*
> * Prefer to schedule the task we woke last (assuming it failed
> --
> 2.34.1
>

2024-03-26 03:06:45

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

Hello Youssef,

On 3/25/2024 8:43 PM, Youssef Esmat wrote:
> On Mon, Mar 25, 2024 at 1:03 AM K Prateek Nayak <[email protected]> wrote:
>>
>> With the curr entity's eligibility check, a wakeup preemption is very
>> likely when an entity with positive lag joins the runqueue pushing the
>> avg_vruntime of the runqueue backwards, making the vruntime of the
>> current entity ineligible. This leads to aggressive wakeup preemption
>> which was previously guarded by wakeup_granularity_ns in legacy CFS.
>> Below figure depicts one such aggressive preemption scenario with EEVDF
>> in DeathStarBench [1]:
>>
>> deadline for Nginx
>> |
>> +-------+ | |
>> /-- | Nginx | -|------------------> |
>> | +-------+ | |
>> | |
>> | -----------|-------------------------------> vruntime timeline
>> | \--> rq->avg_vruntime
>> |
>> | wakes service on the same runqueue since system is busy
>> |
>> | +---------+|
>> \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
>> +---------+|
>> | |
>> wakeup | +--|-----+ |
>> preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
>> +--|-----+ |
>> (Nginx ineligible)
>> -----------|-------------------------------> vruntime timeline
>> \--> rq->avg_vruntime
>>
>> When NGINX server is involuntarily switched out, it cannot accept any
>> incoming request, leading to longer turn around time for the clients and
>> thus loss in DeathStarBench throughput.
>>
>> ==================================================================
>> Test : DeathStarBench
>> Units : Normalized latency
>> Interpretation: Lower is better
>> Statistic : Mean
>> ==================================================================
>> tip 1.00
>> eevdf 1.14 (+14.61%)
>>
>> For current running task, skip eligibility check in pick_eevdf() if it
>> has not exhausted the slice promised to it during selection despite the
>> situation having changed since. The behavior is guarded by
>> RUN_TO_PARITY_WAKEUP sched_feat to simplify testing. With
>> RUN_TO_PARITY_WAKEUP enabled, performance loss seen with DeathStarBench
>> since the merge of EEVDF disappears. Following are the results from
>> testing on a Dual Socket 3rd Generation EPYC server (2 x 64C/128T):
>>
>> ==================================================================
>> Test : DeathStarBench
>> Units : Normalized throughput
>> Interpretation: Higher is better
>> Statistic : Mean
>> ==================================================================
>> Pinning scaling tip run-to-parity-wakeup(pct imp)
>> 1CCD 1 1.00 1.16 (%diff: 16%)
>> 2CCD 2 1.00 1.03 (%diff: 3%)
>> 4CCD 4 1.00 1.12 (%diff: 12%)
>> 8CCD 8 1.00 1.05 (%diff: 6%)
>>
>> With spec_rstack_overflow=off, the DeathStarBench performance with the
>> proposed solution is same as the performance on v6.5 release before
>> EEVDF was merged.
>
> Thanks for sharing this Prateek.
> We actually noticed we could also gain performance by disabling
> eligibility checks (but disable it on all paths).
> The following are a few threads we had on the topic:
>
> Discussion around eligibility:
> https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@mail.gmail.com/
> Some of our results:
> https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@mail.gmail.com/
> Sched feature to disable eligibility:
> https://lore.kernel.org/lkml/[email protected]/
>

Thank you for pointing me to the discussions. I'll give this a spin on
my machine and report back what I see. Hope some of it will help during
the OSPM discussion :)

>>
>> This may lead to newly waking task waiting longer for its turn on the
>> CPU, however, testing on the same system did not reveal any consistent
>> regressions with the standard benchmarks.
>>
>> Link: https://github.com/delimitrou/DeathStarBench/ [1]
>> Signed-off-by: K Prateek Nayak <[email protected]>
>> ---
>> [..snip..]
>>

--
Thanks and Regards,
Prateek

2024-03-28 10:32:14

by Madadi Vineeth Reddy

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] sched/eevdf: Curb wakeup preemption further

Hi Prateek,

On 25/03/24 11:32, K Prateek Nayak wrote:
> Wakeup preemption problem
> =========================
>
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.

Is base_slice_ns not guarding it in EEVDF?

> Below figure depicts one such aggressive preemption scenario with EEVDF:

Thanks and Regards
Madadi Vineeth Reddy

2024-03-29 03:17:16

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 0/1] sched/eevdf: Curb wakeup preemption further

On 3/28/2024 3:56 PM, Madadi Vineeth Reddy wrote:
> Hi Prateek,
>
> On 25/03/24 11:32, K Prateek Nayak wrote:
>> Wakeup preemption problem
>> =========================
>>
>> With the curr entity's eligibility check, a wakeup preemption is very
>> likely when an entity with positive lag joins the runqueue pushing the
>> avg_vruntime of the runqueue backwards, making the vruntime of the
>> current entity ineligible. This leads to aggressive wakeup preemption
>> which was previously guarded by wakeup_granularity_ns in legacy CFS.
>
> Is base_slice_ns not guarding it in EEVDF?

"base_slice_ns" guards it in some sense since lag clamping is based on
the entity's slice but the problem is the eligibility criteria which is
purely based on rq's avg_vruntime and entity's vruntime. Previously
"wakeup_granularity_ns" would have added a buffer before preempting the
current task soon into its run but this is no longer the case with EEVDF
especially if new entities join the runqueue with positive lag pulling
the avg_vruntime backwards.

Please do correct me if I've missed something.

>
>> Below figure depicts one such aggressive preemption scenario with EEVDF:
>
> Thanks and Regards
> Madadi Vineeth Reddy

--
Thanks and Regards,
Prateek

2024-04-17 06:08:47

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

Hello Youssef,

On 3/26/2024 8:36 AM, K Prateek Nayak wrote:
>> [..snip..]
>>
>> Thanks for sharing this Prateek.
>> We actually noticed we could also gain performance by disabling
>> eligibility checks (but disable it on all paths).
>> The following are a few threads we had on the topic:
>>
>> Discussion around eligibility:
>> https://lore.kernel.org/lkml/CA+q576MS0-MV1Oy-eecvmYpvNT3tqxD8syzrpxQ-Zk310hvRbw@mail.gmail.com/
>> Some of our results:
>> https://lore.kernel.org/lkml/CA+q576Mov1jpdfZhPBoy_hiVh3xSWuJjXdP3nS4zfpqfOXtq7Q@mail.gmail.com/
>> Sched feature to disable eligibility:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>
> Thank you for pointing me to the discussions. I'll give this a spin on
> my machine and report back what I see. Hope some of it will help during
> the OSPM discussion :)

Sorry about the delay but on a positive note, I do not see any
concerning regressions after dropping the eligibility criteria. I'll
leave the full results from my testing below.

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

tip: tip:sched/core at commit 4475cd8bfd9b
("sched/balancing: Simplify the sg_status
bitmask and use separate ->overloaded and
->overutilized flags")

eie: (everyone is eligible)
tip + vruntime_eligible() and entity_eligible()
always returns true.

o Results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) eie[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.94) 0.95 [ 5.11]( 2.56)
2-groups 1.00 [ -0.00]( 2.41) 0.97 [ 2.80]( 1.52)
4-groups 1.00 [ -0.00]( 1.16) 0.95 [ 5.01]( 1.04)
8-groups 1.00 [ -0.00]( 1.72) 0.96 [ 4.37]( 1.01)
16-groups 1.00 [ -0.00]( 2.16) 0.94 [ 5.88]( 2.30)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) eie[pct imp](CV)
1 1.00 [ 0.00]( 0.69) 1.00 [ 0.05]( 0.61)
2 1.00 [ 0.00]( 0.25) 1.00 [ 0.06]( 0.51)
4 1.00 [ 0.00]( 1.04) 0.98 [ -1.69]( 1.21)
8 1.00 [ 0.00]( 0.72) 1.00 [ -0.13]( 0.56)
16 1.00 [ 0.00]( 2.40) 1.00 [ 0.43]( 0.63)
32 1.00 [ 0.00]( 0.62) 0.98 [ -1.80]( 2.18)
64 1.00 [ 0.00]( 1.19) 0.98 [ -2.13]( 1.26)
128 1.00 [ 0.00]( 0.91) 1.00 [ 0.37]( 0.50)
256 1.00 [ 0.00]( 0.52) 1.00 [ -0.11]( 0.21)
512 1.00 [ 0.00]( 0.36) 1.02 [ 1.54]( 0.58)
1024 1.00 [ 0.00]( 0.26) 1.01 [ 1.21]( 0.41)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) eie[pct imp](CV)
Copy 1.00 [ 0.00]( 5.01) 1.01 [ 1.27]( 4.63)
Scale 1.00 [ 0.00]( 6.93) 1.03 [ 2.66]( 5.20)
Add 1.00 [ 0.00]( 5.94) 1.03 [ 3.41]( 4.99)
Triad 1.00 [ 0.00]( 6.40) 0.95 [ -4.69]( 8.29)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) eie[pct imp](CV)
Copy 1.00 [ 0.00]( 2.84) 1.00 [ -0.37]( 2.44)
Scale 1.00 [ 0.00]( 5.26) 1.00 [ 0.21]( 3.88)
Add 1.00 [ 0.00]( 4.98) 1.00 [ 0.11]( 1.15)
Triad 1.00 [ 0.00]( 1.60) 0.96 [ -3.72]( 5.26)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) eie[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.90) 1.00 [ -0.09]( 0.16)
2-clients 1.00 [ 0.00]( 0.77) 0.99 [ -0.89]( 0.97)
4-clients 1.00 [ 0.00]( 0.63) 0.99 [ -1.03]( 1.53)
8-clients 1.00 [ 0.00]( 0.52) 0.99 [ -0.86]( 1.66)
16-clients 1.00 [ 0.00]( 0.43) 0.99 [ -0.91]( 0.79)
32-clients 1.00 [ 0.00]( 0.88) 0.98 [ -2.37]( 1.42)
64-clients 1.00 [ 0.00]( 1.63) 0.96 [ -4.07]( 0.91) *
128-clients 1.00 [ 0.00]( 0.94) 1.00 [ -0.30]( 0.94)
256-clients 1.00 [ 0.00]( 5.08) 0.95 [ -4.95]( 3.36)
512-clients 1.00 [ 0.00](51.89) 0.99 [ -0.93](51.00)

* This seems to be the only point of regression with low CV. I'll
rerun this and report back if I see a consistent dip but for the
time being I'm not worried.


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) eie[pct imp](CV)
1 1.00 [ -0.00](30.01) 0.97 [ 3.12](14.32)
2 1.00 [ -0.00](26.14) 1.23 [-22.58](13.48)
4 1.00 [ -0.00](13.22) 1.00 [ -0.00]( 6.04)
8 1.00 [ -0.00]( 6.23) 1.00 [ -0.00](13.09)
16 1.00 [ -0.00]( 3.49) 1.02 [ -1.69]( 3.43)
32 1.00 [ -0.00]( 2.20) 0.98 [ 2.13]( 2.47)
64 1.00 [ -0.00]( 7.17) 0.88 [ 12.50]( 3.18)
128 1.00 [ -0.00]( 2.79) 1.02 [ -2.46]( 8.29)
256 1.00 [ -0.00](13.02) 1.01 [ -1.34](37.58)
512 1.00 [ -0.00]( 4.27) 0.79 [ 21.49]( 2.41)


==================================================================
Test : DeathStarBench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : Mean
==================================================================
Pinning scaling tip eie (pct imp)
1CCD 1 1.00 1.15 (%diff: 15.68%)
2CCD 2 1.00 0.99 (%diff: -1.12%)
4CCD 4 1.00 1.11 (%diff: 11.65%)
8CCD 8 1.00 1.05 (%diff: 4.98%)

--

>
> [..snip..]
>

I'll try to get data from more workloads, will update the thread with
when it arrives.

--
Thanks and Regards,
Prateek

2024-04-24 19:02:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

On Mon, Mar 25, 2024 at 11:32:26AM +0530, K Prateek Nayak wrote:
> With the curr entity's eligibility check, a wakeup preemption is very
> likely when an entity with positive lag joins the runqueue pushing the
> avg_vruntime of the runqueue backwards, making the vruntime of the
> current entity ineligible. This leads to aggressive wakeup preemption
> which was previously guarded by wakeup_granularity_ns in legacy CFS.
> Below figure depicts one such aggressive preemption scenario with EEVDF
> in DeathStarBench [1]:
>
> deadline for Nginx
> |
> +-------+ | |
> /-- | Nginx | -|------------------> |
> | +-------+ | |
> | |
> | -----------|-------------------------------> vruntime timeline
> | \--> rq->avg_vruntime
> |
> | wakes service on the same runqueue since system is busy
> |
> | +---------+|
> \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
> +---------+|
> | |
> wakeup | +--|-----+ |
> preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
> +--|-----+ |
> (Nginx ineligible)
> -----------|-------------------------------> vruntime timeline
> \--> rq->avg_vruntime

This graph is really hard to interpret. If you want to illustrate
avg_vruntime moves back, you should not align it. That's really
disorienting.

In both (upper and lower) nginx has the same vruntime thus *that* should
be aligned. The lower will have service placed left and with that
avg_vruntime also moves left, rendering nginx in-eligible.


> Signed-off-by: K Prateek Nayak <[email protected]>
> ---
> kernel/sched/fair.c | 24 ++++++++++++++++++++----
> kernel/sched/features.h | 1 +
> 2 files changed, 21 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..a9b145a4eab0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> *
> * Which allows tree pruning through eligibility.
> */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
> {
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> if (cfs_rq->nr_running == 1)
> return curr && curr->on_rq ? curr : se;
>
> - if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> + if (curr && !curr->on_rq)
> + curr = NULL;
> +
> + /*
> + * When an entity with positive lag wakes up, it pushes the
> + * avg_vruntime of the runqueue backwards. This may causes the
> + * current entity to be ineligible soon into its run leading to
> + * wakeup preemption.
> + *
> + * To prevent such aggressive preemption of the current running
> + * entity during task wakeups, skip the eligibility check if the
> + * slice promised to the entity since its selection has not yet
> + * elapsed.
> + */
> + if (curr &&
> + !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
> + !entity_eligible(cfs_rq, curr))
> curr = NULL;
>
> /*

So I see what you want to do, but this is highly unreadable.

I'll try something like the below on top of queue/sched/eevdf, but I
should probably first look at fixing those reported crashes on that tree
:/

---
kernel/sched/fair.c | 60 ++++++++++++++++++++++++++++++++++---------------
kernel/sched/features.h | 11 +++++----
2 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a9206d8532f..23977ed1cb2c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -855,6 +855,39 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
return __node_2_se(left);
}

+static inline bool pick_curr(struct cfs_rq *cfs_rq,
+ struct sched_entity *curr, struct sched_entity *wakee)
+{
+ /*
+ * Nothing to preserve...
+ */
+ if (!curr || !sched_feat(RESPECT_SLICE))
+ return false;
+
+ /*
+ * Allow preemption at the 0-lag point -- even if not all of the slice
+ * is consumed. Note: placement of positive lag can push V left and render
+ * @curr instantly ineligible irrespective the time on-cpu.
+ */
+ if (sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, curr))
+ return false;
+
+ /*
+ * Don't preserve @curr when the @wakee has a shorter slice and earlier
+ * deadline. IOW, explicitly allow preemption.
+ */
+ if (sched_feat(PREEMPT_SHORT) && wakee &&
+ wakee->slice < curr->slice &&
+ (s64)(wakee->deadline - curr->deadline) < 0)
+ return false;
+
+ /*
+ * Preserve @curr to allow it to finish its first slice.
+ * See the HACK in set_next_entity().
+ */
+ return curr->vlag == curr->deadline;
+}
+
/*
* Earliest Eligible Virtual Deadline First
*
@@ -874,28 +907,27 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
*
* Which allows tree pruning through eligibility.
*/
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *wakee)
{
struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *curr = cfs_rq->curr;
struct sched_entity *best = NULL;

+ if (curr && !curr->on_rq)
+ curr = NULL;
+
/*
* We can safely skip eligibility check if there is only one entity
* in this cfs_rq, saving some cycles.
*/
if (cfs_rq->nr_running == 1)
- return curr && curr->on_rq ? curr : se;
-
- if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
- curr = NULL;
+ return curr ?: se;

/*
- * Once selected, run a task until it either becomes non-eligible or
- * until it gets a new slice. See the HACK in set_next_entity().
+ * Preserve @curr to let it finish its slice.
*/
- if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
+ if (pick_curr(cfs_rq, curr, wakee))
return curr;

/* Pick the leftmost entity if it's eligible */
@@ -5507,7 +5539,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
return cfs_rq->next;
}

- struct sched_entity *se = pick_eevdf(cfs_rq);
+ struct sched_entity *se = pick_eevdf(cfs_rq, NULL);
if (se->sched_delayed) {
dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
SCHED_WARN_ON(se->sched_delayed);
@@ -8548,15 +8580,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);

- if (sched_feat(PREEMPT_SHORT) && pse->slice < se->slice &&
- entity_eligible(cfs_rq, pse) &&
- (s64)(pse->deadline - se->deadline) < 0 &&
- se->vlag == se->deadline) {
- /* negate RUN_TO_PARITY */
- se->vlag = se->deadline - 1;
- }
-
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq, pse) == pse)
goto preempt;

return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 64ce99cf04ec..2285dc30294c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -10,12 +10,15 @@ SCHED_FEAT(PLACE_LAG, true)
*/
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
/*
- * Inhibit (wakeup) preemption until the current task has either matched the
- * 0-lag point or until is has exhausted it's slice.
+ * Inhibit (wakeup) preemption until the current task has exhausted its slice.
*/
-SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(RESPECT_SLICE, true)
/*
- * Allow tasks with a shorter slice to disregard RUN_TO_PARITY
+ * Relax RESPECT_SLICE to allow preemption once current has reached 0-lag.
+ */
+SCHED_FEAT(RUN_TO_PARITY, false)
+/*
+ * Allow tasks with a shorter slice to disregard RESPECT_SLICE
*/
SCHED_FEAT(PREEMPT_SHORT, true)


2024-04-25 03:34:43

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] sched/eevdf: Skip eligibility check for current entity during wakeup preemption

Hello Peter,

On 4/24/2024 8:37 PM, Peter Zijlstra wrote:
> On Mon, Mar 25, 2024 at 11:32:26AM +0530, K Prateek Nayak wrote:
>> With the curr entity's eligibility check, a wakeup preemption is very
>> likely when an entity with positive lag joins the runqueue pushing the
>> avg_vruntime of the runqueue backwards, making the vruntime of the
>> current entity ineligible. This leads to aggressive wakeup preemption
>> which was previously guarded by wakeup_granularity_ns in legacy CFS.
>> Below figure depicts one such aggressive preemption scenario with EEVDF
>> in DeathStarBench [1]:
>>
>> deadline for Nginx
>> |
>> +-------+ | |
>> /-- | Nginx | -|------------------> |
>> | +-------+ | |
>> | |
>> | -----------|-------------------------------> vruntime timeline
>> | \--> rq->avg_vruntime
>> |
>> | wakes service on the same runqueue since system is busy
>> |
>> | +---------+|
>> \-->| Service || (service has +ve lag pushes avg_vruntime backwards)
>> +---------+|
>> | |
>> wakeup | +--|-----+ |
>> preempts \---->| N|ginx | --------------------> | {deadline for Nginx}
>> +--|-----+ |
>> (Nginx ineligible)
>> -----------|-------------------------------> vruntime timeline
>> \--> rq->avg_vruntime
>
> This graph is really hard to interpret. If you want to illustrate
> avg_vruntime moves back, you should not align it. That's really
> disorienting.
>
> In both (upper and lower) nginx has the same vruntime thus *that* should
> be aligned. The lower will have service placed left and with that
> avg_vruntime also moves left, rendering nginx in-eligible.

Sorry about that. I'll keep this in mind for for future illustrations.

>
>
>> Signed-off-by: K Prateek Nayak <[email protected]>
>> ---
>> kernel/sched/fair.c | 24 ++++++++++++++++++++----
>> kernel/sched/features.h | 1 +
>> 2 files changed, 21 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6a16129f9a5c..a9b145a4eab0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -875,7 +875,7 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>> *
>> * Which allows tree pruning through eligibility.
>> */
>> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, bool wakeup_preempt)
>> {
>> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>> struct sched_entity *se = __pick_first_entity(cfs_rq);
>> @@ -889,7 +889,23 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>> if (cfs_rq->nr_running == 1)
>> return curr && curr->on_rq ? curr : se;
>>
>> - if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>> + if (curr && !curr->on_rq)
>> + curr = NULL;
>> +
>> + /*
>> + * When an entity with positive lag wakes up, it pushes the
>> + * avg_vruntime of the runqueue backwards. This may causes the
>> + * current entity to be ineligible soon into its run leading to
>> + * wakeup preemption.
>> + *
>> + * To prevent such aggressive preemption of the current running
>> + * entity during task wakeups, skip the eligibility check if the
>> + * slice promised to the entity since its selection has not yet
>> + * elapsed.
>> + */
>> + if (curr &&
>> + !(sched_feat(RUN_TO_PARITY_WAKEUP) && wakeup_preempt && curr->vlag == curr->deadline) &&
>> + !entity_eligible(cfs_rq, curr))
>> curr = NULL;
>>
>> /*
>
> So I see what you want to do, but this is highly unreadable.
>
> I'll try something like the below on top of queue/sched/eevdf, but I
> should probably first look at fixing those reported crashes on that tree
> :/

I'll give this a try since Mike's suggestion seems to have fixed the
crash I was observing :) Thank you for suggesting this alternative.
--
Thanks and Regards,
Prateek

>
> ---
> kernel/sched/fair.c | 60 ++++++++++++++++++++++++++++++++++---------------
> kernel/sched/features.h | 11 +++++----
> 2 files changed, 49 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a9206d8532f..23977ed1cb2c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -855,6 +855,39 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> return __node_2_se(left);
> }
>
> +static inline bool pick_curr(struct cfs_rq *cfs_rq,
> + struct sched_entity *curr, struct sched_entity *wakee)
> +{
> + /*
> + * Nothing to preserve...
> + */
> + if (!curr || !sched_feat(RESPECT_SLICE))
> + return false;
> +
> + /*
> + * Allow preemption at the 0-lag point -- even if not all of the slice
> + * is consumed. Note: placement of positive lag can push V left and render
> + * @curr instantly ineligible irrespective the time on-cpu.
> + */
> + if (sched_feat(RUN_TO_PARITY) && !entity_eligible(cfs_rq, curr))
> + return false;
> +
> + /*
> + * Don't preserve @curr when the @wakee has a shorter slice and earlier
> + * deadline. IOW, explicitly allow preemption.
> + */
> + if (sched_feat(PREEMPT_SHORT) && wakee &&
> + wakee->slice < curr->slice &&
> + (s64)(wakee->deadline - curr->deadline) < 0)
> + return false;
> +
> + /*
> + * Preserve @curr to allow it to finish its first slice.
> + * See the HACK in set_next_entity().
> + */
> + return curr->vlag == curr->deadline;
> +}
> +
> /*
> * Earliest Eligible Virtual Deadline First
> *
> @@ -874,28 +907,27 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> *
> * Which allows tree pruning through eligibility.
> */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *wakee)
> {
> struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
> struct sched_entity *se = __pick_first_entity(cfs_rq);
> struct sched_entity *curr = cfs_rq->curr;
> struct sched_entity *best = NULL;
>
> + if (curr && !curr->on_rq)
> + curr = NULL;
> +
> /*
> * We can safely skip eligibility check if there is only one entity
> * in this cfs_rq, saving some cycles.
> */
> if (cfs_rq->nr_running == 1)
> - return curr && curr->on_rq ? curr : se;
> -
> - if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
> - curr = NULL;
> + return curr ?: se;
>
> /*
> - * Once selected, run a task until it either becomes non-eligible or
> - * until it gets a new slice. See the HACK in set_next_entity().
> + * Preserve @curr to let it finish its slice.
> */
> - if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
> + if (pick_curr(cfs_rq, curr, wakee))
> return curr;
>
> /* Pick the leftmost entity if it's eligible */
> @@ -5507,7 +5539,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
> return cfs_rq->next;
> }
>
> - struct sched_entity *se = pick_eevdf(cfs_rq);
> + struct sched_entity *se = pick_eevdf(cfs_rq, NULL);
> if (se->sched_delayed) {
> dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
> SCHED_WARN_ON(se->sched_delayed);
> @@ -8548,15 +8580,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
> cfs_rq = cfs_rq_of(se);
> update_curr(cfs_rq);
>
> - if (sched_feat(PREEMPT_SHORT) && pse->slice < se->slice &&
> - entity_eligible(cfs_rq, pse) &&
> - (s64)(pse->deadline - se->deadline) < 0 &&
> - se->vlag == se->deadline) {
> - /* negate RUN_TO_PARITY */
> - se->vlag = se->deadline - 1;
> - }
> -
> - if (pick_eevdf(cfs_rq) == pse)
> + if (pick_eevdf(cfs_rq, pse) == pse)
> goto preempt;
>
> return;
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 64ce99cf04ec..2285dc30294c 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -10,12 +10,15 @@ SCHED_FEAT(PLACE_LAG, true)
> */
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> /*
> - * Inhibit (wakeup) preemption until the current task has either matched the
> - * 0-lag point or until is has exhausted it's slice.
> + * Inhibit (wakeup) preemption until the current task has exhausted its slice.
> */
> -SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(RESPECT_SLICE, true)
> /*
> - * Allow tasks with a shorter slice to disregard RUN_TO_PARITY
> + * Relax RESPECT_SLICE to allow preemption once current has reached 0-lag.
> + */
> +SCHED_FEAT(RUN_TO_PARITY, false)
> +/*
> + * Allow tasks with a shorter slice to disregard RESPECT_SLICE
> */
> SCHED_FEAT(PREEMPT_SHORT, true)
>