2023-09-19 13:23:11

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks


* Mike Galbraith <[email protected]> wrote:

> On Tue, 2023-08-22 at 08:33 +0530, K Prateek Nayak wrote:
> > Hello Mike,
>
> Greetings!
>
> > > FWIW, there are more tbench shards lying behind EEVDF than in front.
> > >
> > > tbench 8 on old i7-4790 box
> > > 4.4.302????? 4024
> > > 6.4.11?????? 3668
> > > 6.4.11-eevdf 3522
> > >
> >
> > I agree, but on servers, tbench has been useful to identify a variety of
> > issues [1][2][3] and I believe it is better to pick some shards up than
> > leave them lying around for others to step on :)
>
> Absolutely, but in this case it isn't due to various overheads wiggling
> about and/or bitrot, everything's identical except the scheduler, and
> its overhead essentially is too.
>
> taskset -c 3 pipe-test
> 6.4.11 1.420033 usecs/loop -- avg 1.420033 1408.4 KHz
> 6.4.11-eevdf 1.413024 usecs/loop -- avg 1.413024 1415.4 KHz
>
> Methinks these shards are due to tbench simply being one of those
> things that happens to like the CFS notion of short term fairness a bit
> better than the EEVDF notion, ie are inevitable fallout tied to the
> very thing that makes EEVDF service less spiky that CFS, and thus will
> be difficult to sweep up.
>
> Too bad I didn't save Peter's test hack to make EEVDF use the same
> notion of fair (not a keeper) as I think that would likely prove it.

BTW., if overscheduling is still an issue, I'm wondering whether we
could go so far as to turn off wakeup preemption for same-prio
SCHED_OTHER tasks altogether, as per the attached patch?

What does this do to your various tests? Test booted only.

Thanks,

Ingo

=============>
From: Ingo Molnar <[email protected]>
Date: Tue, 19 Sep 2023 10:49:51 +0200
Subject: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks

Reduce overscheduling some more: do not wakeup-preempt same-priority
SCHED_OTHER tasks.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a502e3255392..98efe01c8e4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8042,7 +8042,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ if (unlikely(p->policy != SCHED_NORMAL) || likely(p->prio == curr->prio) || !sched_feat(WAKEUP_PREEMPTION))
return;

find_matching_se(&se, &pse);


2023-09-25 15:11:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks


* kernel test robot <[email protected]> wrote:

> Hello,
>
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:

Thanks for the testing, this is useful!

So I've tabulated the results into a much easier to read format:

> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression

And then sorted them along the regression/improvement axis:

> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement

Testing results notes:

- the '+' denotes an inverted improvement. The mixing of signs in the output of the
ktest robot is arguably confusing.

- Any hope getting similar summary format by default? It's much more informative than
just picking up the biggest regression, which wasn't even done correctly AFAICT.

Summary:

While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:

- stress-ng.sigsuspend.ops_per_sec regressions, -93%:

Clearly signal delivery performance hurts from delayed preemption, but
that should be straightforward to resolve, if we are willing to commit
to adding a high-prio insta-wakeup variant API ...

- stress-ng.exec.ops_per_sec -34% regression:

Likewise this possibly expresses that it's better to immediately reschedule
during exec() - but maybe it's more and reflects some unfavorable migration,
as suggested by the NUMA locality figures:

%change %stddev
| \
79317172 -34.2% 52217838 ? 3% numa-numastat.node0.local_node
79360983 -34.2% 52240348 ? 3% numa-numastat.node0.numa_hit
77971050 -33.2% 52068168 ? 3% numa-numastat.node1.local_node
78009071 -33.2% 52089987 ? 3% numa-numastat.node1.numa_hit
88287 -45.7% 47970 ? 2% vmstat.system.cs

- 'blogbench' regression of -59%:

It too has a very large reduction in context switches:

%stddev %change %stddev
\ | \
30035 -49.7% 15097 ? 3% vmstat.system.cs
2243545 ? 2% -4.1% 2152228 blogbench.read_score
52412617 -28.3% 37571769 blogbench.time.file_system_outputs
2682930 -74.1% 694136 blogbench.time.involuntary_context_switches
2369329 -50.0% 1184098 ? 5% blogbench.time.voluntary_context_switches
5851 -35.9% 3752 ? 2% blogbench.write_score

It's unclear to me what's happening with this one, just from these stats,
but it's "write_score" that hurts most.

- 'stress-ng.filename.ops_per_sec' regression of -19%:

This test suffered from an *increase* in context-switching, and a large
increase in CPU-idle:

%stddev %change %stddev
\ | \
4641666 +19.5% 5545394 ? 2% cpuidle..usage
90589 ? 2% +70.5% 154471 ? 2% vmstat.system.cs
628439 -19.2% 507711 stress-ng.filename.ops
10317 -19.0% 8355 stress-ng.filename.ops_per_sec

171981 -59.7% 69333 ? 3% stress-ng.time.involuntary_context_switches
770691 ? 3% +200.9% 2319214 stress-ng.time.voluntary_context_switches

Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).

One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().

The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:

se->deadline += calc_delta_fair(se->slice, se);

Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.

Thanks,

Ingo

2023-09-25 19:21:16

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER tasks

Hi Ingo,

On 2023-09-25 at 13:07:08 +0200, Ingo Molnar wrote:
>
> Anyway, it's clear from these results that while many workloads hurt
> from our notion of wake-preemption, there's several ones that benefit
> from it, especially generic ones like phoronix-test-suite - which have
> no good way to turn off wakeup preemption (SCHED_BATCH might help though).
>
> One way to approach this would be to instead of always doing
> wakeup-preemption (our current default), we could turn it around and
> only use it when it is clearly beneficial - such as signal delivery,
> or exec().
>
> The canonical way to solve this would be give *userspace* a way to
> signal that it's beneficial to preempt immediately, ie. yield(),
> but right now that interface is hurting tasks that only want to
> give other tasks a chance to run, without necessarily giving up
> their own right to run:
>
> se->deadline += calc_delta_fair(se->slice, se);
>

Do you mean we want to give hint to the scheduler that, in most cases
the tasks do not want to be wakeup-preempted, but some other tasks
want to keep current wakeup-preempt strategy?

If we want the current task to be preempted easier, we have to shrink
the current task's slice, which gives up its own right to run. If this
is the case, how about introducing negative se->slice, tasks has a
negative slice indicates that it wants to be wakeup-preempted,
and its real slice is the absolute value of se->slice.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7c0d9cad6e0..019576da9737 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8120,7 +8120,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
- if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ if (unlikely(p->policy != SCHED_NORMAL) || || se->slice > 0 || !sched_feat(WAKEUP_PREEMPTION))
return;

find_matching_se(&se, &pse);

thanks,
Chenyu