2023-08-10 15:09:04

by Oliver Sang

[permalink] [raw]
Subject: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression



Hello,

kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:


commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf

testcase: phoronix-test-suite
test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
parameters:

test: blogbench-1.1.0
option_a: Write
cpufreq_governor: performance


(
previously, we reported
"[tip:sched/eevdf] [sched/fair] e0c2ff903c: pft.faults_per_sec_per_cpu 7.0% improvement"
on
https://lore.kernel.org/all/[email protected]/
since now we found a regression, so report again FYI
)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230810/[email protected]

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/Write/debian-x86_64-phoronix/lkp-csl-2sp7/blogbench-1.1.0/phoronix-test-suite

commit:
af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
e0c2ff903c ("sched/fair: Remove sched_feat(START_DEBIT)")

af4cf40470c22efa e0c2ff903c320d3fd3c2c604dc4
---------------- ---------------------------
%stddev %change %stddev
\ | \
13.43 ? 6% +1.7 15.15 ? 5% mpstat.cpu.all.idle%
4.516e+09 ? 7% +13.6% 5.129e+09 ? 6% cpuidle..time
5386162 ? 6% +15.1% 6199901 ? 5% cpuidle..usage
5829408 ? 7% -8.7% 5320418 ? 4% numa-numastat.node0.local_node
5839930 ? 7% -8.7% 5333078 ? 4% numa-numastat.node0.numa_hit
6025065 ? 5% -12.9% 5245696 ? 7% numa-numastat.node1.local_node
57086 -28.2% 40989 vmstat.io.bo
11120354 -21.6% 8721798 vmstat.memory.cache
18750 +12.1% 21014 vmstat.system.cs
215070 ? 25% +29.4% 278379 ? 23% numa-meminfo.node1.AnonPages.max
5507703 ? 13% -31.6% 3766183 ? 23% numa-meminfo.node1.FilePages
2118171 ? 13% -26.7% 1551936 ? 25% numa-meminfo.node1.Inactive
6824965 ? 12% -26.7% 5005740 ? 18% numa-meminfo.node1.MemUsed
4960 -34.8% 3235 ? 2% phoronix-test-suite.blogbench.Write.final_score
35120853 -29.4% 24804986 phoronix-test-suite.time.file_system_outputs
8058 -1.9% 7908 phoronix-test-suite.time.percent_of_cpu_this_job_got
26517 -1.2% 26196 phoronix-test-suite.time.system_time
1445969 +10.6% 1599011 phoronix-test-suite.time.voluntary_context_switches
600079 ? 3% +26.4% 758665 ? 3% turbostat.C1E
0.75 ? 3% +0.1 0.84 ? 4% turbostat.C1E%
4372028 ? 7% +14.1% 4987180 ? 7% turbostat.C6
13.09 ? 6% +1.6 14.72 ? 6% turbostat.C6%
11.39 ? 5% +10.4% 12.57 ? 4% turbostat.CPU%c1
2330913 ? 13% -23.5% 1782895 ? 14% numa-vmstat.node0.nr_dirtied
2256547 ? 13% -23.8% 1719197 ? 14% numa-vmstat.node0.nr_written
5840154 ? 7% -8.7% 5333324 ? 4% numa-vmstat.node0.numa_hit
5829632 ? 7% -8.7% 5320664 ? 4% numa-vmstat.node0.numa_local
2287685 ? 11% -33.8% 1514622 ? 20% numa-vmstat.node1.nr_dirtied
1376658 ? 13% -31.6% 941500 ? 23% numa-vmstat.node1.nr_file_pages
127.17 ? 18% -52.4% 60.50 ? 31% numa-vmstat.node1.nr_writeback
2215841 ? 12% -34.1% 1460272 ? 20% numa-vmstat.node1.nr_written
6025232 ? 5% -12.9% 5245703 ? 7% numa-vmstat.node1.numa_local
74.24 ? 8% -38.1% 45.97 ? 11% sched_debug.cfs_rq:/.load_avg.avg
3167 ? 14% -78.6% 676.56 ? 45% sched_debug.cfs_rq:/.load_avg.max
339.56 ? 15% -74.2% 87.77 ? 38% sched_debug.cfs_rq:/.load_avg.stddev
1829 ? 6% -12.7% 1597 ? 9% sched_debug.cfs_rq:/.runnable_avg.max
1490 ? 7% -13.5% 1289 ? 7% sched_debug.cfs_rq:/.util_avg.max
1495 ? 7% -13.3% 1296 ? 8% sched_debug.cfs_rq:/.util_est_enqueued.max
33772 +15.4% 38971 sched_debug.cpu.nr_switches.avg
117253 ? 10% +106.6% 242187 ? 13% sched_debug.cpu.nr_switches.max
11889 ? 9% +96.8% 23399 ? 12% sched_debug.cpu.nr_switches.stddev
58611259 -50.0% 29305629 sched_debug.sysctl_sched.sysctl_sched_features
6885022 ? 2% -25.3% 5139844 ? 2% meminfo.Active
6735239 ? 2% -25.9% 4988483 ? 2% meminfo.Active(file)
10388292 -21.3% 8173968 meminfo.Cached
3953695 -11.9% 3483739 meminfo.Inactive
2793179 -16.8% 2323187 meminfo.Inactive(file)
716630 -21.7% 561030 meminfo.KReclaimable
12938216 -18.5% 10538792 meminfo.Memused
716630 -21.7% 561030 meminfo.SReclaimable
462556 -10.4% 414369 meminfo.SUnreclaim
1179187 -17.3% 975399 meminfo.Slab
13543613 -19.3% 10926031 meminfo.max_used_kB
1682930 ? 2% -25.9% 1246741 ? 2% proc-vmstat.nr_active_file
4618598 -28.6% 3297498 ? 2% proc-vmstat.nr_dirtied
115523 ? 2% -4.3% 110605 proc-vmstat.nr_dirty
2596313 -21.3% 2043401 proc-vmstat.nr_file_pages
698013 -16.8% 580677 proc-vmstat.nr_inactive_file
179075 -21.7% 140220 proc-vmstat.nr_slab_reclaimable
115576 -10.4% 103600 proc-vmstat.nr_slab_unreclaimable
200.00 ? 12% -60.2% 79.67 ? 27% proc-vmstat.nr_writeback
4472388 -28.9% 3179453 ? 2% proc-vmstat.nr_written
1682930 ? 2% -25.9% 1246741 ? 2% proc-vmstat.nr_zone_active_file
698013 -16.8% 580677 proc-vmstat.nr_zone_inactive_file
115652 ? 2% -4.4% 110571 proc-vmstat.nr_zone_write_pending
11888053 -10.8% 10598323 ? 2% proc-vmstat.numa_hit
11861225 -10.9% 10573027 ? 2% proc-vmstat.numa_local
3434596 -28.4% 2460772 proc-vmstat.pgactivate
18752650 -8.2% 17219808 ? 3% proc-vmstat.pgalloc_normal
18252683 -8.4% 16727157 ? 3% proc-vmstat.pgfree
19379562 -27.5% 14053587 ? 2% proc-vmstat.pgpgout
205988 +3.5% 213118 proc-vmstat.pgreuse
97.12 -1.1 96.06 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
97.08 -1.1 96.02 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
52.33 -1.0 51.36 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
52.32 -1.0 51.35 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
97.12 -1.1 96.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
97.08 -1.1 96.02 perf-profile.children.cycles-pp.do_syscall_64
85.21 -1.0 84.18 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
85.58 -1.0 84.56 perf-profile.children.cycles-pp._raw_spin_lock
52.33 -1.0 51.36 perf-profile.children.cycles-pp.__x64_sys_openat
52.32 -1.0 51.35 perf-profile.children.cycles-pp.do_sys_openat2
0.64 ? 2% +0.0 0.68 ? 2% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.28 ? 5% +0.0 0.32 ? 5% perf-profile.children.cycles-pp.__irq_exit_rcu
0.02 ?141% +0.1 0.07 ? 15% perf-profile.children.cycles-pp.__x64_sys_rename
0.08 ? 29% +0.1 0.14 ? 19% perf-profile.children.cycles-pp.process_one_work
0.01 ?223% +0.1 0.07 ? 15% perf-profile.children.cycles-pp.do_renameat2
0.08 ? 29% +0.1 0.15 ? 20% perf-profile.children.cycles-pp.worker_thread
0.03 ?101% +0.1 0.10 ? 21% perf-profile.children.cycles-pp.__extent_writepage
0.03 ?103% +0.1 0.11 ? 20% perf-profile.children.cycles-pp.do_writepages
0.03 ?103% +0.1 0.11 ? 20% perf-profile.children.cycles-pp.extent_writepages
0.03 ?103% +0.1 0.11 ? 20% perf-profile.children.cycles-pp.extent_write_cache_pages
84.73 -1.0 83.69 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
5.487e+09 -2.2% 5.364e+09 perf-stat.i.branch-instructions
0.73 ? 8% +0.2 0.92 ? 12% perf-stat.i.branch-miss-rate%
18867 +12.0% 21136 perf-stat.i.context-switches
2.307e+11 -2.0% 2.261e+11 perf-stat.i.cpu-cycles
1352 ? 2% +3.5% 1399 perf-stat.i.cpu-migrations
4638095 ? 27% -38.2% 2865514 ? 23% perf-stat.i.dTLB-load-misses
6.107e+09 -2.3% 5.965e+09 perf-stat.i.dTLB-loads
2.391e+10 -2.3% 2.335e+10 perf-stat.i.instructions
0.17 -7.8% 0.16 ? 2% perf-stat.i.ipc
0.62 ? 2% +14.5% 0.71 ? 6% perf-stat.i.major-faults
2.40 -2.0% 2.35 perf-stat.i.metric.GHz
139.98 -2.3% 136.76 perf-stat.i.metric.M/sec
66.90 -1.8 65.11 perf-stat.i.node-load-miss-rate%
65029001 -5.5% 61464545 ? 3% perf-stat.i.node-load-misses
0.08 ? 27% -0.0 0.05 ? 24% perf-stat.overall.dTLB-load-miss-rate%
67.28 -2.3 64.93 perf-stat.overall.node-load-miss-rate%
5.474e+09 -2.2% 5.352e+09 perf-stat.ps.branch-instructions
18821 +12.1% 21105 perf-stat.ps.context-switches
2.301e+11 -2.0% 2.256e+11 perf-stat.ps.cpu-cycles
1349 ? 2% +3.6% 1398 perf-stat.ps.cpu-migrations
4630176 ? 27% -38.2% 2859589 ? 23% perf-stat.ps.dTLB-load-misses
6.092e+09 -2.3% 5.952e+09 perf-stat.ps.dTLB-loads
2.385e+10 -2.3% 2.33e+10 perf-stat.ps.instructions
0.61 ? 2% +14.6% 0.70 ? 6% perf-stat.ps.major-faults
64859061 -5.5% 61314167 ? 3% perf-stat.ps.node-load-misses
8.031e+12 -1.4% 7.921e+12 perf-stat.total.instructions



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



2023-08-11 01:50:14

by Chen Yu

[permalink] [raw]
Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
>
>
> commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
>
> testcase: phoronix-test-suite
> test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> parameters:
>
> test: blogbench-1.1.0
> option_a: Write
> cpufreq_governor: performance
>

It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
task, but also increases the vruntime for non-initial task:
Before the e0c2ff903c32, the vruntime for a enqueued task is:
cfs_rq->min_vruntime
After the e0c2ff903c32, the vruntime for a enqueued task is:
avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
= \Sum v_i / nr_tasks
which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
the wakee, which could bring more or less impact to different workloads.
But since later we switched to lag based placement, this new vruntime will minus
lag, which could mitigate this problem.


thanks,
Chenyu


2023-08-11 03:48:48

by Chen Yu

[permalink] [raw]
Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

Hi Peter,

On 2023-08-11 at 09:11:21 +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> >
> >
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> >
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> >
> > test: blogbench-1.1.0
> > option_a: Write
> > cpufreq_governor: performance
> >
>
> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
> = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem.
>
>

Since previously lkp has reported that with eevdf policy enabled, there was
a regression in hackbench, I did some experiments and found that, with eevdf
enabled there are more preemptions, and this preemption could slow down
the waker(each waker could wakes up 20 wakee in hackbench). The reason might
be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

baseline(no eevdf): sched/fair: Add cfs_rq::avg_vruntime
compare(eevdf): sched/smp: Use lag to simplify cross-runqueue placement

hackbench -g 112 -f 20 --threads -l 30000 -s 100
on a system with 1 socket 56C/112T online

1. Use the following bpf script to track the preemption count
within 10 seconds:

tracepoint:sched:sched_switch
{
if (args->prev_state == TASK_RUNNING) {
@preempt = count();
}
}

baseline:
bpftrace preemption.bt
Attaching 4 probes...
10:54:45 Preemption count:
@preempt: 2409638


compare:
bpftrace preemption.bt
Attaching 4 probes...
10:02:21 Preemption count:
@preempt: 12147709

There are much more preemptions with eevdf enabled.


2. Add the following schedstats to track the ratio of successful preemption
in check_preempt_wakeup():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..dfd4a6ebdf23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8294,6 +8294,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
int next_buddy_marked = 0;
int cse_is_idle, pse_is_idle;

+ schedstat_inc(rq->check_preempt_count);
+
if (unlikely(se == pse))
return;

@@ -8358,8 +8360,12 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) == pse) {
+ if (se->vruntime <= pse->vruntime + sysctl_sched_wakeup_granularity)
+ schedstat_inc(rq->low_gran_preempt_count);
+
goto preempt;
+ }

return;
}
@@ -8377,6 +8383,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;

preempt:
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..58abd3d53f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,9 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int low_gran_preempt_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..99392cad0c07 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count, rq->need_preempt_count,
+ rq->low_gran_preempt_count);

seq_printf(seq, "\n");

--
2.25.1

Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 11:02:02 2023 cpu8
.stats.check_preempt_count 51973 <-----
.stats.need_preempt_count 10514 <-----
.stats.rq_cpu_time 5004068598
.stats.rq_sched_info.pcount 60374
.stats.rq_sched_info.run_delay 80405664582
.stats.sched_count 60609
.stats.sched_goidle 227
.stats.ttwu_count 56250
.stats.ttwu_local 14619

The preemption success ration is 10514 / 51973 = 20.23%
-----------------------------------------------------------------------------

With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 10:22:55 2023 cpu8
.stats.check_preempt_count 71673 <----
.stats.low_gran_preempt_count 57410
.stats.need_preempt_count 57413 <----
.stats.rq_cpu_time 5007778990
.stats.rq_sched_info.pcount 129233
.stats.rq_sched_info.run_delay 164830921362
.stats.sched_count 129233
.stats.ttwu_count 70222
.stats.ttwu_local 66847

The preemption success ration is 57413 / 71673 = 80.10%

According to the low_gran_preempt_count, most successfully preemption happens
when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
which will not happen in current cfs's wakeup_preempt_entity().

It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
maybe it is because eevdf needs to consider fairness more?

thanks,
Chenyu

2023-08-14 14:07:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On Fri, Aug 11, 2023 at 09:11:21AM +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> >
> >
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> >
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> >
> > test: blogbench-1.1.0
> > option_a: Write
> > cpufreq_governor: performance
> >

Is this benchmark fork() heavy?

> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
> = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem.

Right.. but given this problem was bisected through the lag based
placement to this commit, I wondered about fork() / pthread_create().

If this is indeed fork()/pthread_create() heavy, could you please see if
disabling PLACE_DEADLINE_INITIAL helps?

2023-08-14 14:12:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On Fri, Aug 11, 2023 at 10:42:09AM +0800, Chen Yu wrote:

> Since previously lkp has reported that with eevdf policy enabled, there was
> a regression in hackbench, I did some experiments and found that, with eevdf
> enabled there are more preemptions, and this preemption could slow down
> the waker(each waker could wakes up 20 wakee in hackbench). The reason might
> be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

This is true.

> Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 11:02:02 2023 cpu8
> .stats.check_preempt_count 51973 <-----
> .stats.need_preempt_count 10514 <-----
> .stats.rq_cpu_time 5004068598
> .stats.rq_sched_info.pcount 60374
> .stats.rq_sched_info.run_delay 80405664582
> .stats.sched_count 60609
> .stats.sched_goidle 227
> .stats.ttwu_count 56250
> .stats.ttwu_local 14619
>
> The preemption success ration is 10514 / 51973 = 20.23%
> -----------------------------------------------------------------------------
>
> With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 10:22:55 2023 cpu8
> .stats.check_preempt_count 71673 <----
> .stats.low_gran_preempt_count 57410
> .stats.need_preempt_count 57413 <----
> .stats.rq_cpu_time 5007778990
> .stats.rq_sched_info.pcount 129233
> .stats.rq_sched_info.run_delay 164830921362
> .stats.sched_count 129233
> .stats.ttwu_count 70222
> .stats.ttwu_local 66847
>
> The preemption success ration is 57413 / 71673 = 80.10%

note: wakeup-preemption

> According to the low_gran_preempt_count, most successfully preemption happens
> when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
> which will not happen in current cfs's wakeup_preempt_entity().
>
> It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
> maybe it is because eevdf needs to consider fairness more?

Not fairness, latency. Because it wants to honour the virtual deadline.


Are these wakeup preemptions typically on runqueues that have only a
single other task?

That is, consider a single task running, then avg_vruntime will be it's
vruntime, because the average of one variable must be the value of that
one variable.

Then the moment a second task joins, we get two options:

- positive lag
- negative lag

When the new task has negative lag, it gets placed to the right of the
currently running task (and avg_vruntime has a forward discontinuity).
At this point the new task is not eligible and does not get to run.

When the new task has positive lag, it gets placed to the left of the
currently running task (and avg_vruntime has a backward discontinuity).
At this point the currently running task is no longer eligible, and the
new task must be selected -- irrespective of it's deadline.

The paper doesn't (AFAIR) consider the case of wake-up-preemption
explicitly. It only considers task selection and vruntime placement.

One option I suppose would be to gate the wakeup preemption by virtual
deadline, only allow when the new task has an earlier deadline than the
currently running one, and otherwise rely on tick preemption.

NOTE: poking at wakeup preemption is a double edged sword, some
workloads love it, some hate it. Touching it is bound to upset the
balance -- again.

(also, did I get that the right way around? -- I've got a Monday brain
that isn't willing to boot properly)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe5be91c71c7..16d24e5dda8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8047,6 +8047,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);

+ if (sched_feat(WAKEUP_DEADLINE)) {
+ /*
+ * Only allow preemption if the virtual deadline of the new
+ * task is before the virtual deadline of the existing task.
+ */
+ if (deadline_gt(deadline, pse, se))
+ return;
+ }
+
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 61bcbf5e46a4..e733981b32aa 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
* Allow wakeup-time preemption of the current task:
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)
+SCHED_FEAT(WAKEUP_DEADLINE, true)

SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)