LinuxLists.cc - [RFC][PATCH 00/10] sched/fair: Complete EEVDF

2024-04-05 13:45:25

Subject: [RFC][PATCH 00/10] sched/fair: Complete EEVDF

Hi all,

I'm slowly crawling back out of the hole and trying to get back to work.
Availability is still low on my end, but I'll try and respond to some email.

Anyway, in order to get my feet wet again with sitting behind a computer, find
here a few patches that should functionally complete the EEVDF journey.

This very much includes the new interface that exposes the extra parameter that
EEVDF has. I've chosen to use sched_attr::sched_runtime for this over a
nice-like value because some workloads actually know their slice length (can be
dynamically measured in the same way as for deadline using
CLOCK_THREAD_CPUTIME_ID) and using the real request size is much more effective
than some relative measure.

[[ using too short a request size will increase job preemption overhead,
using too long a request size will decrease timeliness ]]

The whole delayed-dequeue thing is I think a fundamental thing that was missing
from the EEVDF paper. Without something like this EEVDF will simply not work
right. IIRC this was mentioned to me many years ago when people worked on BFQ
iosched and ran into this same issue. Time had erased the critical aspect of
this note and I had to re-discover it again.

Also, I think Ben expressed concern that preserving lag over long periods
doesn't make sense a while back.

The implementation presented here is one that should work with our cgroup mess
and keeps most of the ugly inside fair.c unlike previous versions which puked
all over the core scheduler code.

Critically cfs-cgroup throttling is not tested, and cgroups are only tested in
so far that a systemd infected machine now boots (took a bit).

Other than that, it works well enough to build the next kernel and it passes
the few trivial latency-slice tests I ran.

Anyway, please have a poke and let me know...

2024-05-27 10:21:17

by K Prateek Nayak

[permalink] [raw]

Subject: Re: [RFC][PATCH 00/10] sched/fair: Complete EEVDF

Hello Peter,

On 4/5/2024 3:57 PM, Peter Zijlstra wrote:
> Hi all,
>
> I'm slowly crawling back out of the hole and trying to get back to work.
> Availability is still low on my end, but I'll try and respond to some email.
>
> Anyway, in order to get my feet wet again with sitting behind a computer, find
> here a few patches that should functionally complete the EEVDF journey.
>
> This very much includes the new interface that exposes the extra parameter that
> EEVDF has. I've chosen to use sched_attr::sched_runtime for this over a
> nice-like value because some workloads actually know their slice length (can be
> dynamically measured in the same way as for deadline using
> CLOCK_THREAD_CPUTIME_ID) and using the real request size is much more effective
> than some relative measure.
>
> [[ using too short a request size will increase job preemption overhead,
> using too long a request size will decrease timeliness ]]
>
> The whole delayed-dequeue thing is I think a fundamental thing that was missing
> from the EEVDF paper. Without something like this EEVDF will simply not work
> right. IIRC this was mentioned to me many years ago when people worked on BFQ
> iosched and ran into this same issue. Time had erased the critical aspect of
> this note and I had to re-discover it again.
>
> Also, I think Ben expressed concern that preserving lag over long periods
> doesn't make sense a while back.
>
> The implementation presented here is one that should work with our cgroup mess
> and keeps most of the ugly inside fair.c unlike previous versions which puked
> all over the core scheduler code.
>
> Critically cfs-cgroup throttling is not tested, and cgroups are only tested in
> so far that a systemd infected machine now boots (took a bit).
>
> Other than that, it works well enough to build the next kernel and it passes
> the few trivial latency-slice tests I ran.
>
> Anyway, please have a poke and let me know...
>

Sorry for the delay, I was waiting for all the basic issues to get fixed
before giving it a test but since OSPM is this week and folks would want
to discuss the series, I'm sharing the results based on the current

git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/eevdf

At the start of the testing the HEAD was at commit 3883d848e41a
("sched/eevdf: Use sched_attr::sched_runtime to set request/slice
suggestion") and I'm using commit 4c43611fe406 ("Merge branch
'tip/sched/urgent'") as the base for comparison.

tl;dr

Mixed bag of results. We have hackbench and netperf slightly unhappy.
stream is as is, no surprises there, and tbench and schbench seems to
have a swing from being unhappy to happy as we scale. I'll leave full
results below including some experiments around
sched_feat(RESPECT_SLICE) proposed here
https://lore.kernel.org/lkml/[email protected]/

o System Details

- 3rd Generation EPYC System
- 2 x 64C/128T
- NPS1 mode

o Kernels

incomplete: peterz/queue:sched/eevdf at commit 4c43611fe406
("Merge branch 'tip/sched/urgent'")

complete: peterz/queue:sched/eevdf at commit 3883d848e41a
("sched/eevdf: Use sched_attr::sched_runtime to
set request/slice suggestion")

Note: The above mentioned tree is prove to force updates and the
commit IDs reflect the state from when the branch was last updated
on "2024-04-30"

o Results

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: incomplete[pct imp](CV) complete[pct imp](CV)
1-groups 1.00 [ -0.00]( 3.42) 1.32 [-32.19]( 2.87)
2-groups 1.00 [ -0.00]( 1.28) 1.39 [-39.10]( 1.70)
4-groups 1.00 [ -0.00]( 0.93) 1.50 [-49.85]( 1.28)
8-groups 1.00 [ -0.00]( 1.00) 1.67 [-66.81]( 1.39)
16-groups 1.00 [ -0.00]( 2.69) 1.68 [-68.05]( 1.79)

==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: incomplete[pct imp](CV) complete[pct imp](CV)
1 1.00 [ 0.00]( 0.41) 0.98 [ -2.50]( 0.38)
2 1.00 [ 0.00]( 0.48) 0.96 [ -4.17]( 3.05)
4 1.00 [ 0.00]( 0.43) 0.96 [ -4.18]( 1.45)
8 1.00 [ 0.00]( 0.97) 0.93 [ -6.77]( 0.97)
16 1.00 [ 0.00]( 0.21) 0.91 [ -8.88]( 1.74)
32 1.00 [ 0.00]( 1.11) 0.92 [ -8.31]( 2.67)
64 1.00 [ 0.00]( 1.81) 0.96 [ -3.80]( 2.41)
128 1.00 [ 0.00]( 0.44) 1.02 [ 2.43]( 0.92)
256 1.00 [ 0.00]( 0.26) 0.96 [ -4.35]( 1.07)
512 1.00 [ 0.00]( 0.12) 0.97 [ -2.87]( 0.64)
1024 1.00 [ 0.00]( 0.32) 0.97 [ -2.59]( 0.26)

==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: incomplete[pct imp](CV) complete[pct imp](CV)
Copy 1.00 [ 0.00](11.28) 0.80 [-20.19](15.70)
Scale 1.00 [ 0.00]( 5.61) 0.99 [ -1.27]( 6.81)
Add 1.00 [ 0.00]( 6.76) 0.91 [ -8.97]( 7.97)
Triad 1.00 [ 0.00]( 4.76) 0.88 [-12.27]( 8.45)

==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: incomplete[pct imp](CV) complete[pct imp](CV)
Copy 1.00 [ 0.00]( 3.28) 0.95 [ -4.61]( 3.81)
Scale 1.00 [ 0.00]( 2.00) 0.99 [ -0.87]( 5.04)
Add 1.00 [ 0.00]( 1.44) 0.99 [ -0.51]( 2.12)
Triad 1.00 [ 0.00]( 2.41) 1.01 [ 0.63]( 1.45)

==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: incomplete[pct imp](CV) complete[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.33) 0.99 [ -0.85]( 1.65)
2-clients 1.00 [ 0.00]( 0.58) 0.99 [ -1.18]( 0.35)
4-clients 1.00 [ 0.00]( 0.32) 0.99 [ -0.78]( 1.27)
8-clients 1.00 [ 0.00]( 0.31) 0.99 [ -1.31]( 0.97)
16-clients 1.00 [ 0.00]( 0.53) 0.98 [ -1.95]( 1.11)
32-clients 1.00 [ 0.00]( 0.50) 0.97 [ -2.94]( 1.29)
64-clients 1.00 [ 0.00]( 1.71) 0.95 [ -4.86]( 1.14)
128-clients 1.00 [ 0.00]( 0.82) 0.99 [ -0.67]( 2.55)
256-clients 1.00 [ 0.00]( 3.88) 0.92 [ -7.65]( 3.27)
512-clients 1.00 [ 0.00](43.49) 0.94 [ -6.35](49.81)

==================================================================
Test : schbench (Old)
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: incomplete[pct imp](CV) complete[pct imp](CV)
1 1.00 [ -0.00](34.42) 0.76 [ 23.68](31.33)
2 1.00 [ -0.00]( 6.45) 0.83 [ 16.67]( 5.84)
4 1.00 [ -0.00](14.87) 0.77 [ 22.73](16.44)
8 1.00 [ -0.00](11.30) 1.07 [ -6.52]( 9.55)
16 1.00 [ -0.00]( 6.33) 1.03 [ -3.45]( 6.11)
32 1.00 [ -0.00]( 2.18) 0.94 [ 6.25]( 3.77)
64 1.00 [ -0.00]( 7.23) 0.99 [ 0.51]( 5.93)
128 1.00 [ -0.00]( 6.76) 0.95 [ 5.00]( 5.17)
256 1.00 [ -0.00](11.05) 1.22 [-22.33]( 9.70)
512 1.00 [ -0.00]( 2.15) 0.77 [ 23.20]( 2.81)
--

o Regression Breakdown

Following are some data I could gather for the regression
observed:

- Hackbench

The runtime of hackbench with EEVDF Complete is 24.81% more
that that on the current base kernel. Following is the IBS
profile comparing EEVDF Complete on the left to current base
on the right:

Command: sudo perf record -a -e ibs_op// -- perf bench sched messaging -p -t -l 100000 -g 1

Overhead Command Shared Object Symbol Overhead Command Shared Object Symbol
4.42% sched-messaging [kernel.kallsyms] [k] osq_lock 6.30% sched-messaging [kernel.kallsyms] [k] osq_lock
3.76% sched-messaging [kernel.kallsyms] [k] _copy_from_iter 4.06% sched-messaging [kernel.kallsyms] [k] _copy_to_iter
3.67% sched-messaging [kernel.kallsyms] [k] srso_alias_safe_ret 3.98% sched-messaging [kernel.kallsyms] [k] _copy_from_iter
3.18% sched-messaging [kernel.kallsyms] [k] _copy_to_iter 3.85% sched-messaging [kernel.kallsyms] [k] mutex_spin_on_owner
2.80% sched-messaging [kernel.kallsyms] [k] pipe_read 3.50% sched-messaging [kernel.kallsyms] [k] srso_alias_safe_ret
2.72% sched-messaging [kernel.kallsyms] [k] mutex_spin_on_owner 3.33% sched-messaging [kernel.kallsyms] [k] inode_needs_update_time
2.69% sched-messaging [kernel.kallsyms] [k] pipe_write 2.39% sched-messaging [kernel.kallsyms] [k] pipe_read
2.49% sched-messaging [kernel.kallsyms] [k] syscall_exit_to_user_mode 2.36% sched-messaging [kernel.kallsyms] [k] syscall_exit_to_user_mode
2.40% swapper [kernel.kallsyms] [k] poll_idle <---- 2.30% sched-messaging [kernel.kallsyms] [k] pipe_write
2.31% sched-messaging [kernel.kallsyms] [k] inode_needs_update_time 2.23% sched-messaging [kernel.kallsyms] [k] mutex_lock
2.14% sched-messaging [kernel.kallsyms] [k] srso_alias_return_thunk 2.07% sched-messaging [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0
2.13% sched-messaging [kernel.kallsyms] [k] aa_file_perm 2.02% sched-messaging [kernel.kallsyms] [k] ksys_write
1.99% sched-messaging [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 2.00% sched-messaging [kernel.kallsyms] [k] aa_file_perm
1.99% sched-messaging [kernel.kallsyms] [k] do_syscall_64 1.98% sched-messaging libc.so.6 [.] __GI___libc_write
1.92% sched-messaging [kernel.kallsyms] [k] ksys_write 1.94% sched-messaging [kernel.kallsyms] [k] do_syscall_64
1.90% sched-messaging [kernel.kallsyms] [k] __fdget_pos 1.92% sched-messaging [kernel.kallsyms] [k] srso_alias_return_thunk
1.89% sched-messaging [kernel.kallsyms] [k] mutex_lock 1.76% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64
1.88% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64 1.62% sched-messaging [kernel.kallsyms] [k] __mutex_lock.constprop.0
1.62% sched-messaging libc.so.6 [.] __GI___libc_write 1.48% sched-messaging [kernel.kallsyms] [k] __fdget_pos
1.61% sched-messaging [kernel.kallsyms] [k] vfs_read 1.47% sched-messaging [kernel.kallsyms] [k] vfs_read
1.54% sched-messaging [kernel.kallsyms] [k] apparmor_file_permission 1.31% sched-messaging [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
1.43% sched-messaging [kernel.kallsyms] [k] rep_movs_alternative 1.29% sched-messaging [kernel.kallsyms] [k] apparmor_file_permission
1.40% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 1.25% sched-messaging libc.so.6 [.] read
1.36% sched-messaging [kernel.kallsyms] [k] ksys_read 1.22% sched-messaging [kernel.kallsyms] [k] mutex_unlock
1.35% sched-messaging [kernel.kallsyms] [k] vfs_write 1.18% sched-messaging [kernel.kallsyms] [k] rep_movs_alternative
1.24% sched-messaging libc.so.6 [.] read 1.17% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
1.23% sched-messaging [kernel.kallsyms] [k] srso_alias_untrain_ret 1.11% sched-messaging [kernel.kallsyms] [k] atime_needs_update
1.15% sched-messaging [kernel.kallsyms] [k] __mutex_lock.constprop.0 1.10% sched-messaging [kernel.kallsyms] [k] vfs_write
1.15% sched-messaging [kernel.kallsyms] [k] atime_needs_update 1.08% sched-messaging [kernel.kallsyms] [k] ksys_read
1.11% swapper [kernel.kallsyms] [k] psi_group_change 1.03% sched-messaging [kernel.kallsyms] [k] srso_alias_untrain_ret
1.03% sched-messaging [kernel.kallsyms] [k] copy_page_from_iter 1.01% swapper [kernel.kallsyms] [k] psi_group_change
0.98% sched-messaging [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 0.96% sched-messaging [kernel.kallsyms] [k] copy_page_from_iter
0.95% sched-messaging [kernel.kallsyms] [k] psi_group_change 0.93% sched-messaging [kernel.kallsyms] [k] psi_group_change
0.91% sched-messaging [kernel.kallsyms] [k] copy_page_to_iter 0.79% sched-messaging [kernel.kallsyms] [k] copy_page_to_iter
0.89% sched-messaging [kernel.kallsyms] [k] rw_verify_area 0.76% swapper [kernel.kallsyms] [k] poll_idle <------ poll idle jumps up in case of
0.86% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 0.71% sched-messaging [kernel.kallsyms] [k] rw_verify_area EEVDF complete
0.77% sched-messaging [kernel.kallsyms] [k] mutex_unlock 0.71% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
0.77% sched-messaging [kernel.kallsyms] [k] security_file_permission 0.69% sched-messaging [kernel.kallsyms] [k] try_to_wake_up
0.73% sched-messaging [kernel.kallsyms] [k] select_task_rq_fair 0.67% sched-messaging [kernel.kallsyms] [k] select_task_rq_fair
0.71% sched-messaging [kernel.kallsyms] [k] __schedule 0.63% sched-messaging [kernel.kallsyms] [k] __schedule
0.70% sched-messaging [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack 0.61% sched-messaging [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
0.59% sched-messaging [kernel.kallsyms] [k] syscall_return_via_sysret 0.61% sched-messaging [kernel.kallsyms] [k] security_file_permission
0.57% sched-messaging [kernel.kallsyms] [k] dequeue_entity 0.52% sched-messaging [kernel.kallsyms] [k] update_load_avg
0.54% sched-messaging [kernel.kallsyms] [k] current_time 0.51% sched-messaging [kernel.kallsyms] [k] native_sched_clock
--

Further pinning the workload to one LLC domain (CCX) shows poll_idle
jump to top:

Command: sudo perf record -C 0-7,128-131 -e ibs_op// -- taskset -c 0-7,128-131 perf bench sched messaging -p -t -l 100000 -g 1

Overhead Command Shared Object Symbol Overhead Command Shared Object Symbol
9.14% swapper [kernel.kallsyms] [k] poll_idle <----- 3.55% sched-messaging [kernel.kallsyms] [k] pipe_read
3.02% sched-messaging [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 3.53% sched-messaging [kernel.kallsyms] [k] pipe_write
3.00% sched-messaging [kernel.kallsyms] [k] pipe_read 3.48% sched-messaging [kernel.kallsyms] [k] srso_alias_safe_ret
2.97% sched-messaging [kernel.kallsyms] [k] pipe_write 3.37% sched-messaging [kernel.kallsyms] [k] _copy_from_iter
2.95% sched-messaging [kernel.kallsyms] [k] srso_alias_safe_ret 3.15% sched-messaging [kernel.kallsyms] [k] syscall_exit_to_user_mode
2.83% sched-messaging [kernel.kallsyms] [k] _copy_from_iter 2.76% sched-messaging [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
2.56% sched-messaging [kernel.kallsyms] [k] syscall_exit_to_user_mode 2.62% sched-messaging [kernel.kallsyms] [k] srso_alias_return_thunk
2.23% sched-messaging [kernel.kallsyms] [k] srso_alias_return_thunk 2.59% sched-messaging [kernel.kallsyms] [k] do_syscall_64
2.21% sched-messaging [kernel.kallsyms] [k] do_syscall_64 2.44% sched-messaging [kernel.kallsyms] [k] __fdget_pos
2.05% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64 2.34% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64
2.04% sched-messaging [kernel.kallsyms] [k] __fdget_pos 2.20% sched-messaging [kernel.kallsyms] [k] _copy_to_iter
1.87% sched-messaging [kernel.kallsyms] [k] _copy_to_iter 2.18% sched-messaging [kernel.kallsyms] [k] aa_file_perm
1.82% sched-messaging [kernel.kallsyms] [k] aa_file_perm 2.16% sched-messaging [kernel.kallsyms] [k] vfs_read
1.67% sched-messaging [kernel.kallsyms] [k] apparmor_file_permission 2.08% swapper [kernel.kallsyms] [k] poll_idle <-----
1.65% sched-messaging [kernel.kallsyms] [k] vfs_read 2.03% sched-messaging [kernel.kallsyms] [k] apparmor_file_permission
1.63% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 2.00% sched-messaging [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
1.59% swapper [kernel.kallsyms] [k] psi_group_change 1.85% sched-messaging [kernel.kallsyms] [k] rep_movs_alternative
1.56% sched-messaging [kernel.kallsyms] [k] ksys_write 1.83% sched-messaging [kernel.kallsyms] [k] vfs_write
1.54% sched-messaging [kernel.kallsyms] [k] ksys_read 1.82% sched-messaging [kernel.kallsyms] [k] ksys_write
1.51% sched-messaging [kernel.kallsyms] [k] vfs_write 1.72% sched-messaging [kernel.kallsyms] [k] srso_alias_untrain_ret
1.49% sched-messaging [kernel.kallsyms] [k] rep_movs_alternative 1.71% sched-messaging libc.so.6 [.] __GI___libc_write
1.46% sched-messaging libc.so.6 [.] __GI___libc_write 1.70% sched-messaging [kernel.kallsyms] [k] ksys_read
1.42% sched-messaging [kernel.kallsyms] [k] srso_alias_untrain_ret 1.58% sched-messaging [kernel.kallsyms] [k] atime_needs_update
1.30% sched-messaging libc.so.6 [.] read 1.56% sched-messaging libc.so.6 [.] read
1.24% sched-messaging [kernel.kallsyms] [k] atime_needs_update 1.56% sched-messaging [kernel.kallsyms] [k] mutex_lock
1.20% sched-messaging [kernel.kallsyms] [k] mutex_lock 1.35% sched-messaging [kernel.kallsyms] [k] psi_group_change
1.04% sched-messaging [kernel.kallsyms] [k] psi_group_change 1.26% sched-messaging [kernel.kallsyms] [k] copy_page_to_iter
1.03% sched-messaging [kernel.kallsyms] [k] inode_needs_update_time 1.20% sched-messaging [kernel.kallsyms] [k] select_task_rq_fair
0.99% sched-messaging [kernel.kallsyms] [k] rw_verify_area 1.16% sched-messaging [kernel.kallsyms] [k] inode_needs_update_time
0.98% sched-messaging [kernel.kallsyms] [k] copy_page_to_iter 1.16% sched-messaging [kernel.kallsyms] [k] rw_verify_area
0.95% sched-messaging [kernel.kallsyms] [k] select_task_rq_fair 1.12% sched-messaging [kernel.kallsyms] [k] copy_page_from_iter
0.95% sched-messaging [kernel.kallsyms] [k] copy_page_from_iter 1.06% swapper [kernel.kallsyms] [k] psi_group_change
0.92% sched-messaging [kernel.kallsyms] [k] __schedule 1.04% sched-messaging [kernel.kallsyms] [k] security_file_permission
0.85% sched-messaging [kernel.kallsyms] [k] security_file_permission 0.96% sched-messaging [kernel.kallsyms] [k] try_to_wake_up
0.68% swapper [kernel.kallsyms] [k] do_idle 0.86% sched-messaging [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
0.67% sched-messaging [kernel.kallsyms] [k] dequeue_entity 0.80% sched-messaging [kernel.kallsyms] [k] mutex_spin_on_owner
0.67% sched-messaging [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack 0.79% sched-messaging [kernel.kallsyms] [k] __schedule
0.66% sched-messaging [kernel.kallsyms] [k] update_load_avg 0.79% sched-messaging [kernel.kallsyms] [k] syscall_return_via_sysret
0.65% sched-messaging [kernel.kallsyms] [k] syscall_return_via_sysret 0.75% sched-messaging [kernel.kallsyms] [k] current_time
0.62% sched-messaging [kernel.kallsyms] [k] mutex_spin_on_owner 0.66% sched-messaging [kernel.kallsyms] [k] __mutex_lock.constprop.0
0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 0.63% sched-messaging perf [.] sender
0.60% swapper [kernel.kallsyms] [k] __schedule 0.63% sched-messaging [kernel.kallsyms] [k] update_load_avg
0.59% sched-messaging [kernel.kallsyms] [k] current_time 0.59% sched-messaging perf [.] receiver
0.54% swapper [kernel.kallsyms] [k] enqueue_entity 0.56% sched-messaging [kernel.kallsyms] [k] file_update_time
0.52% swapper [kernel.kallsyms] [k] menu_select 0.55% sched-messaging [kernel.kallsyms] [k] touch_atime
0.50% sched-messaging perf [.] sender 0.52% sched-messaging [kernel.kallsyms] [k] x64_sys_call
--

This jump up of poll_idle seems to be a higher order behavior change as
a result of Complete EEVDF. Looking at "sched/fair: Implement delayed
dequeue", can there be a case where ttwu_runnable() in try_to_wake_up()
can stop task migration on wakeup and queue it on a busy queue where it
was marked for a dealyed dequeue?

When looking at per-task statistics from sched-scoreboard, I see a 41%
decrease in nr_migrations going from base to Complete EEVDF. I also
see the cumulative wait sum shoot up by 88%.

comm: sched-messaging incomplete complete %diff
Average of avg_atom 0.118341601 0.036585796 -69%
Average of avg_per_cpu 0.446550378 0.668355323 50%
Sum of nr_migrations 1414593 829870 -41% *
Sum of nr_switches 4013420 3563675 -11%
Sum of nr_voluntary_switches 3912865 3562547 -9%
Sum of nr_involuntary_switches 100555 1128 -99%
Sum of nr_wakeups 3913621 3567097 -9%
Sum of sum_exec_runtime 95557.37505 101666.7651 6%
Sum of sum_idle_runtime 1928546.553 975487.7551 -49%
Sum of wait_sum 69884.24284 131511.9207 88% *

For tbench and netperf I could not get any concrete culprit from
perf with IBS or per-task stats. I'll continue digging. However, it
could just be similar higher order effect where poll_idle steals
cycles from smt sibling running the task.

---------------------------------------------------------
- Experiments with sched_attr::sched_runtime on SPECjbb -
---------------------------------------------------------

I tried some experiments with sched_attr:sched_runtime on SPECjbb.
With the base kernel, I was not able see any changes in the SPECjbb
results with various sched_attr::sched_runtime set for various
threads. However, with Complete EEVDF (contains RESPECT_SLICE), I'm
able to eek out some performance improvements on the workload.

Background information on the threads of SPECjbb - txinject drives
the workload by injecting transaction. MapReduce, ForkJoinPool, and
the Backend threads are responsible for the main business logic of
transaction processing. There are GC threads from Java and a
compiler thread for what I presume is some form of JIT compilation.

I've used following naive strategy:

- Set a very small slice to txinject since I would like to push as much
load as possible, as quickly as possible.

- Very small slice to ForkJoinPool and MapReduce threads so they can go
ahead and pass the load to backend threads.

- Very small slice to compiler threads so thay can unblock whatever is
waiting on the JIT compilation.

- A very large slice to the backend thread so it can process as much
data as possible in one go.

- Maximum slice size to the GC thread purely because it should not
interrupt other processes.

With the above setting, I see the following trend:
EEVDF Complete vs base
Max-jOPS +7.05%
Critical-jOPS -15.62%

The key takeaway of the experiment is that, with EEVDF complete, I'm
actually starting to see some behavior difference with different values
unlike with the current tip:Sched/core. I believe it is RESPECT_SLICE
helping to keep tasks with large slices on the runqueue beyond the
0-lag point.

Note: These results are from naive experiments. I'm sure looking at the
per-task statistics and deciding on the slices more judiciously can
further benefit the workload.

On a parting note, it was not very straightforward to set these values
outside of the application without any application modification. In my
case, I was consuming "task_rename" evvents via trace pipe to find the
relevant pids to then tag.

------------------------------------
- DeathStarBench and RESPECT_SLICE -
------------------------------------

Following is the trend I see for RESPECT_SLICE on DeathStarBench:

Pinning #instances EEVDF_COMPLETE
1LLC 1 9%
2LLC 2 3%
4LLC 3 -1%
8LLC 6 -5%

Dropping RESPECT_SLICE and using my original RUN_TO_PARITY_WAKEUP
https://lore.kernel.org/lkml/[email protected]/
still shows regression for larger instances pinned to larger number of
LLC domains with EEVDF Complete. This remains to be investigated as
well.

I'll update the thread with more information in the coming days.
--
Thanks and Regards,
Prateek