LinuxLists.cc - [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final

2023-08-10 15:09:04

Subject: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

Hello,

kernel test robot

commit: e0c2ff903c320d3fd3c
If you fix the issue the same patch/commit), | Reported-by: kernel | Closes: =========================== compiler/cpufreq_governor/k gcc-12/performance/x86_64
commit:
af4cf40470 ("sched/fair: e0c2ff903c ("sched/fair:
af4cf40470c22efa e0c2ff903c ---------------- ---------- %stddev %change \ | 13.43 ? 6% +1.7 4.516e+09 ? 7% 5386162 ? 6% 5829408 ? 7% 5839930 ? 7% 6025065 ? 5% 57086 -28.2% 11120354 18750 +12.1% 215070 ? 25% +29.4% 5507703 ? 13% 2118171 ? 13% 6824965 ? 12% 4960 -34.8% 35120853 8058 -1.9% 26517 -1.2% 1445969 600079 ? 3% +26.4% 0.75 ? 3% +0.1 4372028 ? 7% 13.09 ? 6% +1.6 11.39 ? 5% +10.4% 2330913 ? 13% 2256547 ? 13% 5840154 ? 7% 5829632 ? 7% 2287685 ? 11% 1376658 ? 13% -31.6% 127.17 ? 18% -52.4% 2215841 ? 12% 6025232 ? 5% 74.24 ? 8% -38.1% 3167 ? 14% -78.6% 339.56 ? 15% -74.2% 1829 ? 6% -12.7% 1490 ? 7% -13.5% 1495 ? 7% -13.3% 33772 +15.4% 117253 ? 10% +106.6% 11889 ? 9% +96.8% 58611259 6885022 ? 2% 6735239 ? 2% 10388292 3953695 2793179 716630 -21.7% 12938216 716630 -21.7% 462556 -10.4% 1179187 -17.3% 13543613 1682930 ? 2% 4618598 115523 ? 2% 2596313 698013 -16.8% 179075 -21.7% 115576 -10.4% 200.00 ? 12% -60.2% 4472388 1682930 ? 2% 698013 -16.8% 115652 ? 2% 11888053 11861225 3434596 18752650 18252683 19379562 205988 97.12 -1.1 97.08 -1.1 52.33 -1.0 52.32 -1.0 97.12 -1.1 97.08 -1.1 85.21 -1.0 85.58 -1.0 52.33 -1.0 52.32 -1.0 0.64 ? 2% +0.0 0.28 ? 5% +0.0 0.02 ?141% +0.1 0.08 ? 29% +0.1 0.01 ?223% +0.1 0.08 ? 29% +0.1 0.03 ?101% +0.1 0.03 ?103% +0.1 0.03 ?103% +0.1 0.03 ?103% +0.1 84.73 -1.0 5.487e+09 0.73 ? 8% +0.2 18867 +12.0% 2.307e+11 1352 ? 2% +3.5% 4638095 ? 27% 6.107e+09 2.391e+10 0.17 -7.8% 0.62 ? 2% +14.5% 2.40 -2.0% 139.98 66.90 -1.8 65029001 0.08 ? 27% -0.0 67.28 -2.3 5.474e+09 18821 +12.1% 2.301e+11 1349 ? 2% +3.6% 4630176 ? 27% 6.092e+09 2.385e+10 0.61 ? 2% +14.6% 64859061 8.031e+12

Disclaimer:
Results have been for informational design or configuration

--
0-DAY CI Kernel Test Service
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
e0c2ff903c: pft.faults_per_sec_per_cpu 7.0% improvement"
el.org/all/202308091624.d97ae058-oliver.sang@intel.com/">https://lore.kernel.org/all/[email protected]/
a regression, so report again FYI
in a separate patch/commit (i.e. not just a new version of
kindly add following tags
test robot <[email protected]>
/lore.kernel.org/oe-lkp/202308101628.7af4631a-oliver.sang@intel.com">https://lore.kernel.org/oe-lkp/[email protected]
----------------------------------------------------------------------->
and materials to reproduce are available at:
01.org/0day-ci/archive/20230810/202308101628.7af4631a-oliver.sang@intel.com">https://download.01.org/0day-ci/archive/20230810/[email protected]
==============================================================
config/option_a/rootfs/tbox_group/test/testcase:
-rhel-8.3/Write/debian-x86_64-phoronix/lkp-csl-2sp7/blogbench-1.1.0/phoronix-test-suite
Add cfs_rq::avg_vruntime")
Remove sched_feat(START_DEBIT)")
320d3fd3c2c604dc4
-----------------
%stddev
\
15.15 ? 5% mpstat.cpu.all.idle%
+13.6% 5.129e+09 ? 6% cpuidle..time
+15.1% 6199901 ? 5% cpuidle..usage
-8.7% 5320418 ? 4% numa-numastat.node0.local_node
-8.7% 5333078 ? 4% numa-numastat.node0.numa_hit
-12.9% 5245696 ? 7% numa-numastat.node1.local_node
40989 vmstat.io.bo
-21.6% 8721798 vmstat.memory.cache
21014 vmstat.system.cs
278379 ? 23% numa-meminfo.node1.AnonPages.max
-31.6% 3766183 ? 23% numa-meminfo.node1.FilePages
-26.7% 1551936 ? 25% numa-meminfo.node1.Inactive
-26.7% 5005740 ? 18% numa-meminfo.node1.MemUsed
3235 ? 2% phoronix-test-suite.blogbench.Write.final_score
-29.4% 24804986 phoronix-test-suite.time.file_system_outputs
7908 phoronix-test-suite.time.percent_of_cpu_this_job_got
26196 phoronix-test-suite.time.system_time
+10.6% 1599011 phoronix-test-suite.time.voluntary_context_switches
758665 ? 3% turbostat.C1E
0.84 ? 4% turbostat.C1E%
+14.1% 4987180 ? 7% turbostat.C6
14.72 ? 6% turbostat.C6%
12.57 ? 4% turbostat.CPU%c1
-23.5% 1782895 ? 14% numa-vmstat.node0.nr_dirtied
-23.8% 1719197 ? 14% numa-vmstat.node0.nr_written
-8.7% 5333324 ? 4% numa-vmstat.node0.numa_hit
-8.7% 5320664 ? 4% numa-vmstat.node0.numa_local
-33.8% 1514622 ? 20% numa-vmstat.node1.nr_dirtied
941500 ? 23% numa-vmstat.node1.nr_file_pages
60.50 ? 31% numa-vmstat.node1.nr_writeback
-34.1% 1460272 ? 20% numa-vmstat.node1.nr_written
-12.9% 5245703 ? 7% numa-vmstat.node1.numa_local
45.97 ? 11% sched_debug.cfs_rq:/.load_avg.avg
676.56 ? 45% sched_debug.cfs_rq:/.load_avg.max
87.77 ? 38% sched_debug.cfs_rq:/.load_avg.stddev
1597 ? 9% sched_debug.cfs_rq:/.runnable_avg.max
1289 ? 7% sched_debug.cfs_rq:/.util_avg.max
1296 ? 8% sched_debug.cfs_rq:/.util_est_enqueued.max
38971 sched_debug.cpu.nr_switches.avg
242187 ? 13% sched_debug.cpu.nr_switches.max
23399 ? 12% sched_debug.cpu.nr_switches.stddev
-50.0% 29305629 sched_debug.sysctl_sched.sysctl_sched_features
-25.3% 5139844 ? 2% meminfo.Active
-25.9% 4988483 ? 2% meminfo.Active(file)
-21.3% 8173968 meminfo.Cached
-11.9% 3483739 meminfo.Inactive
-16.8% 2323187 meminfo.Inactive(file)
561030 meminfo.KReclaimable
-18.5% 10538792 meminfo.Memused
561030 meminfo.SReclaimable
414369 meminfo.SUnreclaim
975399 meminfo.Slab
-19.3% 10926031 meminfo.max_used_kB
-25.9% 1246741 ? 2% proc-vmstat.nr_active_file
-28.6% 3297498 ? 2% proc-vmstat.nr_dirtied
-4.3% 110605 proc-vmstat.nr_dirty
-21.3% 2043401 proc-vmstat.nr_file_pages
580677 proc-vmstat.nr_inactive_file
140220 proc-vmstat.nr_slab_reclaimable
103600 proc-vmstat.nr_slab_unreclaimable
79.67 ? 27% proc-vmstat.nr_writeback
-28.9% 3179453 ? 2% proc-vmstat.nr_written
-25.9% 1246741 ? 2% proc-vmstat.nr_zone_active_file
580677 proc-vmstat.nr_zone_inactive_file
-4.4% 110571 proc-vmstat.nr_zone_write_pending
-10.8% 10598323 ? 2% proc-vmstat.numa_hit
-10.9% 10573027 ? 2% proc-vmstat.numa_local
-28.4% 2460772 proc-vmstat.pgactivate
-8.2% 17219808 ? 3% proc-vmstat.pgalloc_normal
-8.4% 16727157 ? 3% proc-vmstat.pgfree
-27.5% 14053587 ? 2% proc-vmstat.pgpgout
+3.5% 213118 proc-vmstat.pgreuse
96.06 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
96.02 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.36 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
51.35 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe
96.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
96.02 perf-profile.children.cycles-pp.do_syscall_64
84.18 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
84.56 perf-profile.children.cycles-pp._raw_spin_lock
51.36 perf-profile.children.cycles-pp.__x64_sys_openat
51.35 perf-profile.children.cycles-pp.do_sys_openat2
0.68 ? 2% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.32 ? 5% perf-profile.children.cycles-pp.__irq_exit_rcu
0.07 ? 15% perf-profile.children.cycles-pp.__x64_sys_rename
0.14 ? 19% perf-profile.children.cycles-pp.process_one_work
0.07 ? 15% perf-profile.children.cycles-pp.do_renameat2
0.15 ? 20% perf-profile.children.cycles-pp.worker_thread
0.10 ? 21% perf-profile.children.cycles-pp.__extent_writepage
0.11 ? 20% perf-profile.children.cycles-pp.do_writepages
0.11 ? 20% perf-profile.children.cycles-pp.extent_writepages
0.11 ? 20% perf-profile.children.cycles-pp.extent_write_cache_pages
83.69 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
-2.2% 5.364e+09 perf-stat.i.branch-instructions
0.92 ? 12% perf-stat.i.branch-miss-rate%
21136 perf-stat.i.context-switches
-2.0% 2.261e+11 perf-stat.i.cpu-cycles
1399 perf-stat.i.cpu-migrations
-38.2% 2865514 ? 23% perf-stat.i.dTLB-load-misses
-2.3% 5.965e+09 perf-stat.i.dTLB-loads
-2.3% 2.335e+10 perf-stat.i.instructions
0.16 ? 2% perf-stat.i.ipc
0.71 ? 6% perf-stat.i.major-faults
2.35 perf-stat.i.metric.GHz
-2.3% 136.76 perf-stat.i.metric.M/sec
65.11 perf-stat.i.node-load-miss-rate%
-5.5% 61464545 ? 3% perf-stat.i.node-load-misses
0.05 ? 24% perf-stat.overall.dTLB-load-miss-rate%
64.93 perf-stat.overall.node-load-miss-rate%
-2.2% 5.352e+09 perf-stat.ps.branch-instructions
21105 perf-stat.ps.context-switches
-2.0% 2.256e+11 perf-stat.ps.cpu-cycles
1398 perf-stat.ps.cpu-migrations
-38.2% 2859589 ? 23% perf-stat.ps.dTLB-load-misses
-2.3% 5.952e+09 perf-stat.ps.dTLB-loads
-2.3% 2.33e+10 perf-stat.ps.instructions
0.70 ? 6% perf-stat.ps.major-faults
-5.5% 61314167 ? 3% perf-stat.ps.node-load-misses
-1.4% 7.921e+12 perf-stat.total.instructions
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
m/intel/lkp-tests/wiki">https://github.com/intel/lkp-tests/wiki

2023-08-11 01:50:14

by Chen Yu

[permalink] [raw]

Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
>
>
> commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
>
> testcase: phoronix-test-suite
> test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> parameters:
>
> test: blogbench-1.1.0
> option_a: Write
> cpufreq_governor: performance
>

It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
task, but also increases the vruntime for non-initial task:
Before the e0c2ff903c32, the vruntime for a enqueued task is:
cfs_rq->min_vruntime
After the e0c2ff903c32, the vruntime for a enqueued task is:
avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
= \Sum v_i / nr_tasks
which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
the wakee, which could bring more or less impact to different workloads.
But since later we switched to lag based placement, this new vruntime will minus
lag, which could mitigate this problem.

thanks,
Chenyu

2023-08-11 03:48:48

by Chen Yu

[permalink] [raw]

Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

Hi Peter,

On 2023-08-11 at 09:11:21 +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> >
> >
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> >
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> >
> > test: blogbench-1.1.0
> > option_a: Write
> > cpufreq_governor: performance
> >
>
> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
> = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem.
>
>

Since previously lkp has reported that with eevdf policy enabled, there was
a regression in hackbench, I did some experiments and found that, with eevdf
enabled there are more preemptions, and this preemption could slow down
the waker(each waker could wakes up 20 wakee in hackbench). The reason might
be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

baseline(no eevdf): sched/fair: Add cfs_rq::avg_vruntime
compare(eevdf): sched/smp: Use lag to simplify cross-runqueue placement

hackbench -g 112 -f 20 --threads -l 30000 -s 100
on a system with 1 socket 56C/112T online

1. Use the following bpf script to track the preemption count
within 10 seconds:

tracepoint:sched:sched_switch
{
if (args->prev_state == TASK_RUNNING) {
@preempt = count();
}
}

baseline:
bpftrace preemption.bt
Attaching 4 probes...
10:54:45 Preemption count:
@preempt: 2409638

compare:
bpftrace preemption.bt
Attaching 4 probes...
10:02:21 Preemption count:
@preempt: 12147709

There are much more preemptions with eevdf enabled.

2. Add the following schedstats to track the ratio of successful preemption
in check_preempt_wakeup():

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 57e8bc14b06e..dfd4a6ebdf23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8294,6 +8294,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
int next_buddy_marked = 0;
int cse_is_idle, pse_is_idle;

+ schedstat_inc(rq->check_preempt_count);
+
if (unlikely(se == pse))
return;

@@ -8358,8 +8360,12 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
- if (pick_eevdf(cfs_rq) == pse)
+ if (pick_eevdf(cfs_rq) == pse) {
+ if (se->vruntime <= pse->vruntime + sysctl_sched_wakeup_granularity)
+ schedstat_inc(rq->low_gran_preempt_count);
+
goto preempt;
+ }

return;
}
@@ -8377,6 +8383,8 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
return;

preempt:
+ schedstat_inc(rq->need_preempt_count);
+
resched_curr(rq);
/*
* Only set the backward buddy when the current task is still
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa5b293ca4ed..58abd3d53f1d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1128,6 +1128,9 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+ unsigned int check_preempt_count;
+ unsigned int need_preempt_count;
+ unsigned int low_gran_preempt_count;
#endif

#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..99392cad0c07 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,14 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %u %u %u",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->check_preempt_count, rq->need_preempt_count,
+ rq->low_gran_preempt_count);

seq_printf(seq, "\n");

--
2.25.1

Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 11:02:02 2023 cpu8
.stats.check_preempt_count 51973 <-----
.stats.need_preempt_count 10514 <-----
.stats.rq_cpu_time 5004068598
.stats.rq_sched_info.pcount 60374
.stats.rq_sched_info.run_delay 80405664582
.stats.sched_count 60609
.stats.sched_goidle 227
.stats.ttwu_count 56250
.stats.ttwu_local 14619

The preemption success ration is 10514 / 51973 = 20.23%
-----------------------------------------------------------------------------

With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
Thu Aug 10 10:22:55 2023 cpu8
.stats.check_preempt_count 71673 <----
.stats.low_gran_preempt_count 57410
.stats.need_preempt_count 57413 <----
.stats.rq_cpu_time 5007778990
.stats.rq_sched_info.pcount 129233
.stats.rq_sched_info.run_delay 164830921362
.stats.sched_count 129233
.stats.ttwu_count 70222
.stats.ttwu_local 66847

The preemption success ration is 57413 / 71673 = 80.10%

According to the low_gran_preempt_count, most successfully preemption happens
when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
which will not happen in current cfs's wakeup_preempt_entity().

It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
maybe it is because eevdf needs to consider fairness more?

thanks,
Chenyu

2023-08-14 14:07:56

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On Fri, Aug 11, 2023 at 09:11:21AM +0800, Chen Yu wrote:
> On 2023-08-10 at 21:24:37 +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -34.8% regression of phoronix-test-suite.blogbench.Write.final_score on:
> >
> >
> > commit: e0c2ff903c320d3fd3c2c604dc401b3b7c0a1d13 ("sched/fair: Remove sched_feat(START_DEBIT)")
> > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git sched/eevdf
> >
> > testcase: phoronix-test-suite
> > test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz (Cascade Lake) with 512G memory
> > parameters:
> >
> > test: blogbench-1.1.0
> > option_a: Write
> > cpufreq_governor: performance
> >

Is this benchmark fork() heavy?

> It seems that commit e0c2ff903c32 removed the sched_feat(START_DEBIT) for initial
> task, but also increases the vruntime for non-initial task:
> Before the e0c2ff903c32, the vruntime for a enqueued task is:
> cfs_rq->min_vruntime
> After the e0c2ff903c32, the vruntime for a enqueued task is:
> avg_vruntime(cfs_rq) = \Sum v_i * w_i / W
> = \Sum v_i / nr_tasks
> which is usually higher than cfs_rq->min_vruntime, and we give less sleep bonus to
> the wakee, which could bring more or less impact to different workloads.
> But since later we switched to lag based placement, this new vruntime will minus
> lag, which could mitigate this problem.

Right.. but given this problem was bisected through the lag based
placement to this commit, I wondered about fork() / pthread_create().

If this is indeed fork()/pthread_create() heavy, could you please see if
disabling PLACE_DEADLINE_INITIAL helps?

2023-08-14 14:12:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [tip:sched/eevdf] [sched/fair] e0c2ff903c: phoronix-test-suite.blogbench.Write.final_score -34.8% regression

On Fri, Aug 11, 2023 at 10:42:09AM +0800, Chen Yu wrote:

> Since previously lkp has reported that with eevdf policy enabled, there was
> a regression in hackbench, I did some experiments and found that, with eevdf
> enabled there are more preemptions, and this preemption could slow down
> the waker(each waker could wakes up 20 wakee in hackbench). The reason might
> be that, check_preempt_wakeup() is easier to preempt the current task in eevdf:

This is true.

> Without eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 11:02:02 2023 cpu8
> .stats.check_preempt_count 51973 <-----
> .stats.need_preempt_count 10514 <-----
> .stats.rq_cpu_time 5004068598
> .stats.rq_sched_info.pcount 60374
> .stats.rq_sched_info.run_delay 80405664582
> .stats.sched_count 60609
> .stats.sched_goidle 227
> .stats.ttwu_count 56250
> .stats.ttwu_local 14619
>
> The preemption success ration is 10514 / 51973 = 20.23%
> -----------------------------------------------------------------------------
>
> With eevdf enabled, the /proc/schedstat delta within 5 seconds on CPU8 is:
> Thu Aug 10 10:22:55 2023 cpu8
> .stats.check_preempt_count 71673 <----
> .stats.low_gran_preempt_count 57410
> .stats.need_preempt_count 57413 <----
> .stats.rq_cpu_time 5007778990
> .stats.rq_sched_info.pcount 129233
> .stats.rq_sched_info.run_delay 164830921362
> .stats.sched_count 129233
> .stats.ttwu_count 70222
> .stats.ttwu_local 66847
>
> The preemption success ration is 57413 / 71673 = 80.10%

note: wakeup-preemption

> According to the low_gran_preempt_count, most successfully preemption happens
> when the current->vruntime is smaller than wakee->vruntime + sysctl_sched_wakeup_granularity,
> which will not happen in current cfs's wakeup_preempt_entity().
>
> It seems that, eevdf does not inhit the wakeup preemption as much as cfs, and
> maybe it is because eevdf needs to consider fairness more?

Not fairness, latency. Because it wants to honour the virtual deadline.

Are these wakeup preemptions typically on runqueues that have only a
single other task?

That is, consider a single task running, then avg_vruntime will be it's
vruntime, because the average of one variable must be the value of that
one variable.

Then the moment a second task joins, we get two options:

- positive lag
- negative lag

When the new task has negative lag, it gets placed to the right of the
currently running task (and avg_vruntime has a forward discontinuity).
At this point the new task is not eligible and does not get to run.

When the new task has positive lag, it gets placed to the left of the
currently running task (and avg_vruntime has a backward discontinuity).
At this point the currently running task is no longer eligible, and the
new task must be selected -- irrespective of it's deadline.

The paper doesn't (AFAIR) consider the case of wake-up-preemption
explicitly. It only considers task selection and vruntime placement.

One option I suppose would be to gate the wakeup preemption by virtual
deadline, only allow when the new task has an earlier deadline than the
currently running one, and otherwise rely on tick preemption.

NOTE: poking at wakeup preemption is a double edged sword, some
workloads love it, some hate it. Touching it is bound to upset the
balance -- again.

(also, did I get that the right way around? -- I've got a Monday brain
that isn't willing to boot properly)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe5be91c71c7..16d24e5dda8f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8047,6 +8047,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
cfs_rq = cfs_rq_of(se);
update_curr(cfs_rq);

+ if (sched_feat(WAKEUP_DEADLINE)) {
+ /*
+ * Only allow preemption if the virtual deadline of the new
+ * task is before the virtual deadline of the existing task.
+ */
+ if (deadline_gt(deadline, pse, se))
+ return;
+ }
+
/*
* XXX pick_eevdf(cfs_rq) != se ?
*/
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 61bcbf5e46a4..e733981b32aa 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
* Allow wakeup-time preemption of the current task:
*/
SCHED_FEAT(WAKEUP_PREEMPTION, true)
+SCHED_FEAT(WAKEUP_DEADLINE, true)

SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false)