LinuxLists.cc - EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2023-11-16 19:03:40

Subject: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Hi,

when testing the EEVDF scheduler we stumbled upon a performance
regression in a uperf scenario and would like to
kindly ask for feedback on whether we are going into the right direction
with our analysis so far.

The base scenario are two KVM guests running on an s390 LPAR. One guest
hosts the uperf server, one the uperf client.
With EEVDF we observe a regression of ~50% for a strburst test.
For a more detailed description of the setup see the section TEST
SUMMARY at the bottom.

Bisecting led us to the following commit which appears to introduce the
regression:
86bfbb7ce4f6 sched/fair: Add lag based placement

We then compared the last good commit we identified with a recent level
of the devel branch.
The issue still persists on 6.7 rc1 although there is some improvement
(down from 62% regression to 49%)

All analysis described further are based on a 6.6 rc7 kernel.

We sampled perf data to get an idea on what is going wrong and ended up
seeing an dramatic increase in the maximum
wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
details.

We then collected tracing data to get a better insight into what is
going on.
The trace excerpt in section TRACE EXCERPT shows one example (of
multiple per test run) of the problematic scenario where
a kworker(pid=6525) has to wait for 39,718 ms.

Short summary:
The mentioned kworker has been scheduled to CPU 14 before the tracing
was enabled.
A vhost process is migrated onto CPU 14.
The vruntimes of kworker and vhost differ significantly (86642125805 vs
4242563284 -> factor 20)
The vhost process wants to wake up the kworker, therefore the kworker is
placed onto the runqueue again and set to runnable.
The vhost process continues to execute, waking up other vhost processes
on other CPUs.

So far this behavior is not different to what we see on pre-EEVDF
kernels.

On timestamp 576.162767, the vhost process triggers the last wake up of
another vhost on another CPU.
Until timestamp 576.171155, we see no other activity. Now, the vhost
process ends its time slice.
Then, vhost gets re-assigned new time slices 4 times and gets then
migrated off to CPU 15.
This does not occur with older kernels.
The kworker has to wait for the migration to happen in order to be able
to execute again.
This is due to the fact, that the vruntime of the kworker is
significantly larger than the one of vhost.

We observed the large difference in vruntime between kworker and vhost
in the same magnitude on
a kernel built based on the parent of the commit mentioned above.
With EEVDF, the kworker is doomed to wait until the vhost either catches
up on vruntime (which would take 86 seconds)
or the vhost is migrated off of the CPU.

We found some options which sound plausible but we are not sure if they
are valid or not:

1. The wake up path has a dependency on the vruntime metrics that now
delays the execution of the kworker.
2. The previous commit af4cf40470c2 (sched/fair: Add
cfs_rq::avg_vruntime) which updates the way cfs_rq->min_vruntime and
cfs_rq->avg_runtime are set might have introduced an issue which is
uncovered with the commit mentioned above.
3. An assumption in the vhost code which causes vhost to rely on being
scheduled off in time to allow the kworker to proceed.

We also stumbled upon the following mailing thread:
https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
That conversation, and the patches derived from it lead to the
assumption that the wake up path might be adjustable in a way
that this case in particular can be addressed.
At the same time, the vast difference in vruntimes is concerning since,
at least for some time frame, both processes are on the runqueue.

We would be glad to hear some feedback on which paths to pursue and
which might just be a dead end in the first place.

#################### TRACE EXCERPT ####################
The sched_place trace event was added to the end of the place_entity
function and outputs:
sev -> sched_entity vruntime
sed -> sched_entity deadline
sel -> sched_entity vlag
avg -> cfs_rq avg_vruntime
min -> cfs_rq min_vruntime
cpu -> cpu of cfs_rq
nr -> cfs_rq nr_running
---
CPU 3/KVM-2950 [014] d.... 576.161432: sched_migrate_task:
comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
--> migrates task from cpu 15 to 14
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place:
comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0
avg=4242563284 min=4242563284 cpu=14 nr=0
--> places vhost 2920 on CPU 14 with vruntime 4242563284
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593
cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157
cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372
cpu=14 nr=0
CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980
cpu=14 nr=0
CPU 3/KVM-2950 [014] dN... 576.161434: sched_stat_wait:
comm=vhost-2920 pid=2941 delay=9958 [ns]
CPU 3/KVM-2950 [014] d.... 576.161435: sched_switch:
prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==>
next_comm=vhost-2920 next_pid=2941 next_prio=120
vhost-2920-2941 [014] D.... 576.161439: sched_waking:
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] d.... 576.161446: sched_waking:
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] d.... 576.161447: sched_place:
comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
avg=86642125805 min=86642125805 cpu=14 nr=1
--> places kworker 6525 on cpu 14 with vruntime 86642125805
--> which is far larger than vhost vruntime of 4242563284
vhost-2920-2941 [014] d.... 576.161447: sched_stat_blocked:
comm=kworker/14:0 pid=6525 delay=10143757 [ns]
vhost-2920-2941 [014] dN... 576.161447: sched_wakeup:
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] dN... 576.161448: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
--> vhost 2920 finishes after 13884 ns of runtime
vhost-2920-2941 [014] dN... 576.161448: sched_stat_wait:
comm=kworker/14:0 pid=6525 delay=0 [ns]
vhost-2920-2941 [014] d.... 576.161448: sched_switch:
prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> switch to kworker
kworker/14:0-6525 [014] d.... 576.161449: sched_waking: comm=CPU
2/KVM pid=2949 prio=120 target_cpu=007
kworker/14:0-6525 [014] d.... 576.161450: sched_stat_runtime:
comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
--> kworker finshes after 3714 ns of runtime
kworker/14:0-6525 [014] d.... 576.161450: sched_stat_wait:
comm=vhost-2920 pid=2941 delay=3714 [ns]
kworker/14:0-6525 [014] d.... 576.161451: sched_switch:
prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==>
next_comm=vhost-2920 next_pid=2941 next_prio=120
--> switch back to vhost
vhost-2920-2941 [014] d.... 576.161478: sched_waking:
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] d.... 576.161478: sched_place:
comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150
avg=86642188144 min=86642188144 cpu=14 nr=1
--> kworker placed again on cpu 14 with vruntime 86642191859, the
problem occurs only if lag <= 0, having lag=0 does not always hit the
problem though
vhost-2920-2941 [014] d.... 576.161478: sched_stat_blocked:
comm=kworker/14:0 pid=6525 delay=27943 [ns]
vhost-2920-2941 [014] d.... 576.161479: sched_wakeup:
comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
vhost-2920-2941 [014] D.... 576.161511: sched_waking:
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161512: sched_waking:
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] D.... 576.161516: sched_waking:
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161773: sched_waking:
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.161775: sched_waking:
comm=vhost-2286 pid=2309 prio=120 target_cpu=008
vhost-2920-2941 [014] D.... 576.162103: sched_waking:
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.162105: sched_waking:
comm=vhost-2286 pid=2307 prio=120 target_cpu=021
vhost-2920-2941 [014] D.... 576.162326: sched_waking:
comm=vhost-2286 pid=2305 prio=120 target_cpu=004
vhost-2920-2941 [014] D.... 576.162437: sched_waking:
comm=vhost-2286 pid=2308 prio=120 target_cpu=006
vhost-2920-2941 [014] D.... 576.162767: sched_waking:
comm=vhost-2286 pid=2305 prio=120 target_cpu=004
vhost-2920-2941 [014] d.h.. 576.171155: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
vhost-2920-2941 [014] d.h.. 576.181155: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
vhost-2920-2941 [014] d.h.. 576.191154: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
vhost-2920-2941 [014] d.h.. 576.201155: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
--> vhost gets rescheduled multiple times because its vruntime is
significantly smaller than the vruntime of the kworker
vhost-2920-2941 [014] dNh.. 576.201176: sched_wakeup:
comm=migration/14 pid=85 prio=0 target_cpu=014
vhost-2920-2941 [014] dN... 576.201191: sched_stat_runtime:
comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
vhost-2920-2941 [014] d.... 576.201192: sched_switch:
prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
next_comm=migration/14 next_pid=85 next_prio=0
migration/14-85 [014] d..1. 576.201194: sched_migrate_task:
comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
--> vhost gets migrated off of cpu 14
migration/14-85 [014] d..1. 576.201194: sched_place:
comm=vhost-2920 pid=2941 sev=3198666923 sed=3201666923 sel=0
avg=3198666923 min=3198666923 cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594
cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178
cpu=15 nr=0
migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785
cpu=15 nr=0
migration/14-85 [014] d..1. 576.201196: sched_place: comm= pid=0
sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523
min=70190876523 cpu=15 nr=0
migration/14-85 [014] d.... 576.201198: sched_stat_wait:
comm=kworker/14:0 pid=6525 delay=39718472 [ns]
migration/14-85 [014] d.... 576.201198: sched_switch:
prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==>
next_comm=kworker/14:0 next_pid=6525 next_prio=120
--> only now, kworker is eligible to run again, after a delay of
39718472 ns
kworker/14:0-6525 [014] d.... 576.201200: sched_waking: comm=CPU
0/KVM pid=2947 prio=120 target_cpu=012
kworker/14:0-6525 [014] d.... 576.201290: sched_stat_runtime:
comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]

#################### WAIT DELAYS - PERF LATENCY ####################
last good commit --> perf sched latency -s max

-------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms |
Max delay ms | Max delay start | Max delay end |

-------------------------------------------------------------------------------------------------------------------------------------------
CPU 2/KVM:(2) | 5399.650 ms | 108698 | avg: 0.003 ms |
max: 3.077 ms | max start: 544.090322 s | max end: 544.093399 s
CPU 7/KVM:(2) | 5111.132 ms | 69632 | avg: 0.003 ms |
max: 2.980 ms | max start: 544.690994 s | max end: 544.693974 s
kworker/22:3-ev:723 | 342.944 ms | 63417 | avg: 0.005 ms |
max: 1.880 ms | max start: 545.235430 s | max end: 545.237310 s
CPU 0/KVM:(2) | 8171.431 ms | 433099 | avg: 0.003 ms |
max: 1.004 ms | max start: 547.970344 s | max end: 547.971348 s
CPU 1/KVM:(2) | 5486.260 ms | 258702 | avg: 0.003 ms |
max: 1.002 ms | max start: 548.782514 s | max end: 548.783516 s
CPU 5/KVM:(2) | 4766.143 ms | 65727 | avg: 0.003 ms |
max: 0.997 ms | max start: 545.313610 s | max end: 545.314607 s
vhost-2268:(6) | 13206.503 ms | 315030 | avg: 0.003 ms |
max: 0.989 ms | max start: 550.887761 s | max end: 550.888749 s
vhost-2892:(6) | 14467.268 ms | 214005 | avg: 0.003 ms |
max: 0.981 ms | max start: 545.213819 s | max end: 545.214800 s
CPU 3/KVM:(2) | 5538.908 ms | 85105 | avg: 0.003 ms |
max: 0.883 ms | max start: 547.138139 s | max end: 547.139023 s
CPU 6/KVM:(2) | 5289.827 ms | 72301 | avg: 0.003 ms |
max: 0.836 ms | max start: 551.094590 s | max end: 551.095425 s

6.6 rc7 --> perf sched latency -s max
-------------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Avg delay ms |
Max delay ms | Max delay start | Max delay end |

-------------------------------------------------------------------------------------------------------------------------------------------
kworker/19:2-ev:1071 | 69.482 ms | 12700 | avg: 0.050 ms |
max: 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
kworker/13:1-ev:184 | 78.048 ms | 14645 | avg: 0.067 ms |
max: 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
kworker/12:1-ev:46148 | 138.488 ms | 26660 | avg: 0.021 ms |
max: 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
kworker/16:2-ev:33076 | 149.175 ms | 29491 | avg: 0.026 ms |
max: 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
CPU 3/KVM:(2) | 1934.714 ms | 41896 | avg: 0.007 ms |
max: 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
kworker/7:2-eve:17001 | 68.164 ms | 11820 | avg: 0.045 ms |
max: 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
kworker/17:1-ev:46510 | 68.804 ms | 13328 | avg: 0.037 ms |
max: 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
kworker/21:1-ev:45782 | 68.906 ms | 13215 | avg: 0.021 ms |
max: 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
ksoftirqd/17:101 | 0.041 ms | 2 | avg: 25.028 ms |
max: 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s

#################### TEST SUMMARY ####################
Setup description:
- single KVM host with 2 identical guests
- guests are connected virtually via Open vSwitch
- guests run uperf streaming read workload with 50 parallel connections
- one guests acts as uperf client, the other one as uperf server

Regression:
kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based
placement)
kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based
placement)
kernel-6.7.0-rc1: 41 Gb/s

KVM host:
- 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
- 64 GiB memory
- FEDORA 38
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
audit_debug=0 selinux=0

KVM guests:
- 8 vCPUs
- 8 GiB memory
- RHEL 9.2
- kernel: 5.14.0-162.6.1.el9_1.s390x
- kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
audit_debug=0 selinux=0

Open vSwitch:
- Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
- Open vSwitch ports attached to guests via virtio-net
- each guest has 4 vhost-queues

Domain xml snippet for Open vSwitch port:
<interface type="bridge" dev="OVS">
<source bridge="vswitch0"/>
<mac address="02:bb:97:28:02:02"/>
<virtualport type="openvswitch"/>
<model type="virtio"/>
<target dev="vport1"/>
<driver name="vhost" queues="4"/>
<address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
</interface>

Benchmark: uperf
- workload: str-readx30k, 50 active parallel connections
- uperf server permanently sends data in 30720-byte chunks
- uperf client receives and acknowledges this data
- Server: uperf -s
- Client: uperf -a -i 30 -m uperf.xml

uperf.xml:
<?xml version="1.0"?>
<profile name="strburst">
<group nprocs="50">
<transaction iterations="1">
<flowop type="connect" options="remotehost=10.161.28.3
protocol=tcp "/>
</transaction>
<transaction duration="300">
<flowop type="read" options="count=640 size=30k"/>
</transaction>
<transaction iterations="1">
<flowop type="disconnect" />
</transaction>
</group>
</profile>

2023-11-17 09:26:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

Your email is pretty badly mangled by wrapping, please try and
reconfigure your MUA, esp. the trace and debug output is unreadable.

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:

> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.

Well, that's not good :/

> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)

So bear with me, I know absolutely nothing about virt stuff. I suspect
there's cgroups involved because shiny or something.

kworkers are typically not in cgroups and are part of the root cgroup,
but what's a vhost and where does it live?

Also, what are their weights / nice values?

> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
>
> So far this behavior is not different to what we see on pre-EEVDF kernels.
>
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.

So why does this vhost stay on the CPU if it doesn't have anything to
do? (I've not tried to make sense of the trace, that's just too
painful).

> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.

That's, weird. Can you add a trace_printk() to update_entity_lag() and
have it print out the lag, limit and vlag (post clamping) values? And
also in place_entity() for the reverse process, lag pre and post scaling
or something.

After confirming both tasks are indeed in the same cgroup ofcourse,
because if they're not, vruntime will be meaningless to compare and we
should look elsewhere.

Also, what HZ and what preemption mode are you running? If kworker is
somehow vastly over-shooting it's slice -- keeps running way past the
avg_vruntime, then it will build up a giant lag and you get what you
describe, next time it wakes up it gets placed far to the right (exactly
where it was when it 'finally' went to sleep, relatively speaking).

> We found some options which sound plausible but we are not sure if they are
> valid or not:
>
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
> cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.

Suppose you have a few tasks (of equal weight) on you virtual timeline
like so:

---------+---+---+---+---+------
^ ^
| `avg_vruntime
`-min_vruntime

Then the above would be more or less the relative placements of these
values. avg_vruntime is the weighted average of the various vruntimes
and is therefore always in the 'middle' of the tasks, and not somewhere
out-there.

min_vruntime is a monotonically increasing 'minimum' that's left-ish on
the tree (there's a few cases where a new task can be placed left of
min_vruntime and its no longer actuall the minimum, but whatever).

These values should be relatively close to one another, depending
ofcourse on the spread of the tasks. So I don't think this is causing
trouble.

Anyway, the big difference with lag based placement is that where
previously tasks (that do not migrate) retain their old vruntime and on
placing they get pulled forward to at least min_vruntime, so a task that
wildly overshoots, but then doesn't run for significant time can still
be overtaken and then when placed again be 'okay'.

Now OTOH, with lag-based placement, we strictly preserve their relative
offset vs avg_vruntime. So if they were *far* too the right when they go
to sleep, they will again be there on placement.

Sleeping doesn't help them anymore.

Now, IF this is the problem, I might have a patch that helps:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=119feac4fcc77001cd9bf199b25f08d232289a5c

That branch is based on v6.7-rc1 and then some, but I think it's
relatively easy to rebase the lot on v6.6 (which I'm assuming you're
on).

I'm a little conflicted on the patch, conceptually I like what it does,
but the code it turned into is quite horrible. I've tried implementing
it differently a number of times but always ended up with things that
either didn't work or were worse.

But if it works, it works I suppose.

2023-11-17 09:59:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:
> Now, IF this is the problem, I might have a patch that helps:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=119feac4fcc77001cd9bf199b25f08d232289a5c

And then I turn around and wipe the repository invalidating that link.

The sched/eevdf branch should be re-instated (with different SHA1), but
I'll include the patch below for reference.

---
Subject: sched/eevdf: Delay dequeue
From: Peter Zijlstra <[email protected]>
Date: Fri Sep 15 00:48:45 CEST 2023

For tasks that have negative-lag (have received 'excess' service), delay the
dequeue and keep them in the runnable tree until they're eligible again. Or
rather, keep them until they're selected again, since finding their eligibility
crossover point is expensive.

The effect is a bit like sleeper bonus, the tasks keep contending for service
until either they get a wakeup or until they're selected again and are really
dequeued.

This means that any actual dequeue happens with positive lag (serviced owed)
and are more readily ran when woken next.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 1
kernel/sched/core.c | 88 +++++++++++++++++++++++++++++++++++++++---------
kernel/sched/fair.c | 11 ++++++
kernel/sched/features.h | 11 ++++++
kernel/sched/sched.h | 3 +
5 files changed, 97 insertions(+), 17 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -916,6 +916,7 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
+ unsigned sched_delayed:1;

/* Force alignment to the next boundary: */
unsigned :0;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3856,12 +3856,23 @@ static int ttwu_runnable(struct task_str

rq = __task_rq_lock(p, &rf);
if (task_on_rq_queued(p)) {
+ update_rq_clock(rq);
+ if (unlikely(p->sched_delayed)) {
+ p->sched_delayed = 0;
+ /* mustn't run a delayed task */
+ WARN_ON_ONCE(task_on_cpu(rq, p));
+ if (sched_feat(GENTLE_DELAY)) {
+ dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
+ if (p->se.vlag > 0)
+ p->se.vlag = 0;
+ enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
+ }
+ }
if (!task_on_cpu(rq, p)) {
/*
* When on_rq && !on_cpu the task is preempted, see if
* it should preempt the task that is current now.
*/
- update_rq_clock(rq);
wakeup_preempt(rq, p, wake_flags);
}
ttwu_do_wakeup(p);
@@ -6565,6 +6576,24 @@ pick_next_task(struct rq *rq, struct tas
# define SM_MASK_PREEMPT SM_PREEMPT
#endif

+static void deschedule_task(struct rq *rq, struct task_struct *p, unsigned long prev_state)
+{
+ p->sched_contributes_to_load =
+ (prev_state & TASK_UNINTERRUPTIBLE) &&
+ !(prev_state & TASK_NOLOAD) &&
+ !(prev_state & TASK_FROZEN);
+
+ if (p->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+
+ if (p->in_iowait) {
+ atomic_inc(&rq->nr_iowait);
+ delayacct_blkio_start();
+ }
+}
+
/*
* __schedule() is the main scheduler function.
*
@@ -6650,6 +6679,8 @@ static void __sched notrace __schedule(u

switch_count = &prev->nivcsw;

+ WARN_ON_ONCE(prev->sched_delayed);
+
/*
* We must load prev->state once (task_struct::state is volatile), such
* that we form a control dependency vs deactivate_task() below.
@@ -6659,14 +6690,6 @@ static void __sched notrace __schedule(u
if (signal_pending_state(prev_state, prev)) {
WRITE_ONCE(prev->__state, TASK_RUNNING);
} else {
- prev->sched_contributes_to_load =
- (prev_state & TASK_UNINTERRUPTIBLE) &&
- !(prev_state & TASK_NOLOAD) &&
- !(prev_state & TASK_FROZEN);
-
- if (prev->sched_contributes_to_load)
- rq->nr_uninterruptible++;
-
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
@@ -6678,17 +6701,50 @@ static void __sched notrace __schedule(u
*
* After this, schedule() must not care about p->state any more.
*/
- deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
-
- if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
- delayacct_blkio_start();
- }
+ if (sched_feat(DELAY_DEQUEUE) &&
+ prev->sched_class->delay_dequeue_task &&
+ prev->sched_class->delay_dequeue_task(rq, prev))
+ prev->sched_delayed = 1;
+ else
+ deschedule_task(rq, prev, prev_state);
}
switch_count = &prev->nvcsw;
}

- next = pick_next_task(rq, prev, &rf);
+ for (struct task_struct *tmp = prev;;) {
+ unsigned long tmp_state;
+
+ next = pick_next_task(rq, tmp, &rf);
+ if (unlikely(tmp != prev))
+ finish_task(tmp);
+
+ if (likely(!next->sched_delayed))
+ break;
+
+ next->sched_delayed = 0;
+
+ /*
+ * A sched_delayed task must not be runnable at this point, see
+ * ttwu_runnable().
+ */
+ tmp_state = READ_ONCE(next->__state);
+ if (WARN_ON_ONCE(!tmp_state))
+ break;
+
+ prepare_task(next);
+ /*
+ * Order ->on_cpu and ->on_rq, see the comments in
+ * try_to_wake_up(). Normally this is smp_mb__after_spinlock()
+ * above.
+ */
+ smp_wmb();
+ deschedule_task(rq, next, tmp_state);
+ if (sched_feat(GENTLE_DELAY) && next->se.vlag > 0)
+ next->se.vlag = 0;
+
+ tmp = next;
+ }
+
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
#ifdef CONFIG_SCHED_DEBUG
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8540,6 +8540,16 @@ static struct task_struct *__pick_next_t
return pick_next_task_fair(rq, NULL, NULL);
}

+static bool delay_dequeue_task_fair(struct rq *rq, struct task_struct *p)
+{
+ struct sched_entity *se = &p->se;
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+ update_curr(cfs_rq);
+
+ return !entity_eligible(cfs_rq, se);
+}
+
/*
* Account for a descheduled task:
*/
@@ -13151,6 +13161,7 @@ DEFINE_SCHED_CLASS(fair) = {

.wakeup_preempt = check_preempt_wakeup_fair,

+ .delay_dequeue_task = delay_dequeue_task_fair,
.pick_next_task = __pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,17 @@ SCHED_FEAT(PREEMPT_SHORT, true)
*/
SCHED_FEAT(PLACE_SLEEPER, false)
SCHED_FEAT(GENTLE_SLEEPER, true)
+/*
+ * Delay dequeueing tasks until they get selected or woken.
+ *
+ * By delaying the dequeue for non-eligible tasks, they remain in the
+ * competition and can burn off their negative lag. When they get selected
+ * they'll have positive lag by definition.
+ *
+ * GENTLE_DELAY clips the lag on dequeue (or wakeup) to 0.
+ */
+SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(GENTLE_DELAY, true)

/*
* Prefer to schedule the task we woke last (assuming it failed
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2254,6 +2254,7 @@ struct sched_class {

void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);

+ bool (*delay_dequeue_task)(struct rq *rq, struct task_struct *p);
struct task_struct *(*pick_next_task)(struct rq *rq);

void (*put_prev_task)(struct rq *rq, struct task_struct *p);
@@ -2307,7 +2308,7 @@ struct sched_class {

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
- WARN_ON_ONCE(rq->curr != prev);
+// WARN_ON_ONCE(rq->curr != prev);
prev->sched_class->put_prev_task(rq, prev);
}

2023-11-17 12:38:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:
> On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:

> > kworkers are typically not in cgroups and are part of the root cgroup,
> > but what's a vhost and where does it live?
>
> The qemu instances of the two KVM guests are placed into cgroups.
> The vhosts run within the context of these qemu instances (4 threads per guest).
> So they are also put into those cgroups.
>
> I'll answer the other questions you brought up as well, but I guess that one
> is most critical:
>
> >
> > After confirming both tasks are indeed in the same cgroup ofcourse,
> > because if they're not, vruntime will be meaningless to compare and we
> > should look elsewhere.
>
> In that case we probably have to go with elsewhere ... which is good to know.

Ah, so if this is a cgroup issue, it might be worth trying this patch
that we have in tip/sched/urgent.

I'll try and read the rest of the email a little later, gotta run
errands first.

---

commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
Author: Abel Wu <[email protected]>
Date: Tue Nov 7 17:05:07 2023 +0800

sched/eevdf: Fix vruntime adjustment on reweight

vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2048138ce54b..025d90925bf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3666,41 +3666,140 @@ static inline void
dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
#endif

+static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ unsigned long weight)
+{
+ unsigned long old_weight = se->load.weight;
+ u64 avruntime = avg_vruntime(cfs_rq);
+ s64 vlag, vslice;
+
+ /*
+ * VRUNTIME
+ * ========
+ *
+ * COROLLARY #1: The virtual runtime of the entity needs to be
+ * adjusted if re-weight at !0-lag point.
+ *
+ * Proof: For contradiction assume this is not true, so we can
+ * re-weight without changing vruntime at !0-lag point.
+ *
+ * Weight VRuntime Avg-VRuntime
+ * before w v V
+ * after w' v' V'
+ *
+ * Since lag needs to be preserved through re-weight:
+ *
+ * lag = (V - v)*w = (V'- v')*w', where v = v'
+ * ==> V' = (V - v)*w/w' + v (1)
+ *
+ * Let W be the total weight of the entities before reweight,
+ * since V' is the new weighted average of entities:
+ *
+ * V' = (WV + w'v - wv) / (W + w' - w) (2)
+ *
+ * by using (1) & (2) we obtain:
+ *
+ * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
+ * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
+ * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
+ * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
+ *
+ * Since we are doing at !0-lag point which means V != v, we
+ * can simplify (3):
+ *
+ * ==> W / (W + w' - w) = w / w'
+ * ==> Ww' = Ww + ww' - ww
+ * ==> W * (w' - w) = w * (w' - w)
+ * ==> W = w (re-weight indicates w' != w)
+ *
+ * So the cfs_rq contains only one entity, hence vruntime of
+ * the entity @v should always equal to the cfs_rq's weighted
+ * average vruntime @V, which means we will always re-weight
+ * at 0-lag point, thus breach assumption. Proof completed.
+ *
+ *
+ * COROLLARY #2: Re-weight does NOT affect weighted average
+ * vruntime of all the entities.
+ *
+ * Proof: According to corollary #1, Eq. (1) should be:
+ *
+ * (V - v)*w = (V' - v')*w'
+ * ==> v' = V' - (V - v)*w/w' (4)
+ *
+ * According to the weighted average formula, we have:
+ *
+ * V' = (WV - wv + w'v') / (W - w + w')
+ * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
+ * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
+ * = (WV + w'V' - Vw) / (W - w + w')
+ *
+ * ==> V'*(W - w + w') = WV + w'V' - Vw
+ * ==> V' * (W - w) = (W - w) * V (5)
+ *
+ * If the entity is the only one in the cfs_rq, then reweight
+ * always occurs at 0-lag point, so V won't change. Or else
+ * there are other entities, hence W != w, then Eq. (5) turns
+ * into V' = V. So V won't change in either case, proof done.
+ *
+ *
+ * So according to corollary #1 & #2, the effect of re-weight
+ * on vruntime should be:
+ *
+ * v' = V' - (V - v) * w / w' (4)
+ * = V - (V - v) * w / w'
+ * = V - vl * w / w'
+ * = V - vl'
+ */
+ if (avruntime != se->vruntime) {
+ vlag = (s64)(avruntime - se->vruntime);
+ vlag = div_s64(vlag * old_weight, weight);
+ se->vruntime = avruntime - vlag;
+ }
+
+ /*
+ * DEADLINE
+ * ========
+ *
+ * When the weight changes, the virtual time slope changes and
+ * we should adjust the relative virtual deadline accordingly.
+ *
+ * d' = v' + (d - v)*w/w'
+ * = V' - (V - v)*w/w' + (d - v)*w/w'
+ * = V - (V - v)*w/w' + (d - v)*w/w'
+ * = V + (d - V)*w/w'
+ */
+ vslice = (s64)(se->deadline - avruntime);
+ vslice = div_s64(vslice * old_weight, weight);
+ se->deadline = avruntime + vslice;
+}
+
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight)
{
- unsigned long old_weight = se->load.weight;
+ bool curr = cfs_rq->curr == se;

if (se->on_rq) {
/* commit outstanding execution time */
- if (cfs_rq->curr == se)
+ if (curr)
update_curr(cfs_rq);
else
- avg_vruntime_sub(cfs_rq, se);
+ __dequeue_entity(cfs_rq, se);
update_load_sub(&cfs_rq->load, se->load.weight);
}
dequeue_load_avg(cfs_rq, se);

- update_load_set(&se->load, weight);
-
if (!se->on_rq) {
/*
* Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
* we need to scale se->vlag when w_i changes.
*/
- se->vlag = div_s64(se->vlag * old_weight, weight);
+ se->vlag = div_s64(se->vlag * se->load.weight, weight);
} else {
- s64 deadline = se->deadline - se->vruntime;
- /*
- * When the weight changes, the virtual time slope changes and
- * we should adjust the relative virtual deadline accordingly.
- */
- deadline = div_s64(deadline * old_weight, weight);
- se->deadline = se->vruntime + deadline;
- if (se != cfs_rq->curr)
- min_deadline_cb_propagate(&se->run_node, NULL);
+ reweight_eevdf(cfs_rq, se, weight);
}

+ update_load_set(&se->load, weight);
+
#ifdef CONFIG_SMP
do {
u32 divider = get_pelt_divider(&se->avg);
@@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
enqueue_load_avg(cfs_rq, se);
if (se->on_rq) {
update_load_add(&cfs_rq->load, se->load.weight);
- if (cfs_rq->curr != se)
- avg_vruntime_add(cfs_rq, se);
+ if (!curr) {
+ /*
+ * The entity's vruntime has been adjusted, so let's check
+ * whether the rq-wide min_vruntime needs updated too. Since
+ * the calculations above require stable min_vruntime rather
+ * than up-to-date one, we do the update at the end of the
+ * reweight process.
+ */
+ __enqueue_entity(cfs_rq, se);
+ update_min_vruntime(cfs_rq);
+ }
}
}

@@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)

#ifndef CONFIG_SMP
shares = READ_ONCE(gcfs_rq->tg->shares);
-
- if (likely(se->load.weight == shares))
- return;
#else
- shares = calc_group_shares(gcfs_rq);
+ shares = calc_group_shares(gcfs_rq);
#endif
-
- reweight_entity(cfs_rq_of(se), se, shares);
+ if (unlikely(se->load.weight != shares))
+ reweight_entity(cfs_rq_of(se), se, shares);
}

#else /* CONFIG_FAIR_GROUP_SCHED */

2023-11-17 13:09:17

by Abel Wu

[permalink] [raw]

Subject: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 11/17/23 8:37 PM, Peter Zijlstra Wrote:
> On Fri, Nov 17, 2023 at 01:24:21PM +0100, Tobias Huschle wrote:
>> On Fri, Nov 17, 2023 at 10:23:18AM +0100, Peter Zijlstra wrote:
>
>>> kworkers are typically not in cgroups and are part of the root cgroup,
>>> but what's a vhost and where does it live?
>>
>> The qemu instances of the two KVM guests are placed into cgroups.
>> The vhosts run within the context of these qemu instances (4 threads per guest).
>> So they are also put into those cgroups.
>>
>> I'll answer the other questions you brought up as well, but I guess that one
>> is most critical:
>>
>>>
>>> After confirming both tasks are indeed in the same cgroup ofcourse,
>>> because if they're not, vruntime will be meaningless to compare and we
>>> should look elsewhere.
>>
>> In that case we probably have to go with elsewhere ... which is good to know.
>
> Ah, so if this is a cgroup issue, it might be worth trying this patch
> that we have in tip/sched/urgent.

And please also apply this fix:
https://lore.kernel.org/all/[email protected]/

>
> I'll try and read the rest of the email a little later, gotta run
> errands first.
>
> ---
>
> commit eab03c23c2a162085b13200d7942fc5a00b5ccc8
> Author: Abel Wu <[email protected]>
> Date: Tue Nov 7 17:05:07 2023 +0800
>
> sched/eevdf: Fix vruntime adjustment on reweight
>
> vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
> it gets re-weighted, and the calculations can be simplified based
> on the fact that re-weight won't change the w-average of all the
> entities. Please check the proofs in comments.
>
> But adjusting vruntime can also cause position change in RB-tree
> hence require re-queue to fix up which might be costly. This might
> be avoided by deferring adjustment to the time the entity actually
> leaves tree (dequeue/pick), but that will negatively affect task
> selection and probably not good enough either.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Abel Wu <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2048138ce54b..025d90925bf6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3666,41 +3666,140 @@ static inline void
> dequeue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) { }
> #endif
>
> +static void reweight_eevdf(struct cfs_rq *cfs_rq, struct sched_entity *se,
> + unsigned long weight)
> +{
> + unsigned long old_weight = se->load.weight;
> + u64 avruntime = avg_vruntime(cfs_rq);
> + s64 vlag, vslice;
> +
> + /*
> + * VRUNTIME
> + * ========
> + *
> + * COROLLARY #1: The virtual runtime of the entity needs to be
> + * adjusted if re-weight at !0-lag point.
> + *
> + * Proof: For contradiction assume this is not true, so we can
> + * re-weight without changing vruntime at !0-lag point.
> + *
> + * Weight VRuntime Avg-VRuntime
> + * before w v V
> + * after w' v' V'
> + *
> + * Since lag needs to be preserved through re-weight:
> + *
> + * lag = (V - v)*w = (V'- v')*w', where v = v'
> + * ==> V' = (V - v)*w/w' + v (1)
> + *
> + * Let W be the total weight of the entities before reweight,
> + * since V' is the new weighted average of entities:
> + *
> + * V' = (WV + w'v - wv) / (W + w' - w) (2)
> + *
> + * by using (1) & (2) we obtain:
> + *
> + * (WV + w'v - wv) / (W + w' - w) = (V - v)*w/w' + v
> + * ==> (WV-Wv+Wv+w'v-wv)/(W+w'-w) = (V - v)*w/w' + v
> + * ==> (WV - Wv)/(W + w' - w) + v = (V - v)*w/w' + v
> + * ==> (V - v)*W/(W + w' - w) = (V - v)*w/w' (3)
> + *
> + * Since we are doing at !0-lag point which means V != v, we
> + * can simplify (3):
> + *
> + * ==> W / (W + w' - w) = w / w'
> + * ==> Ww' = Ww + ww' - ww
> + * ==> W * (w' - w) = w * (w' - w)
> + * ==> W = w (re-weight indicates w' != w)
> + *
> + * So the cfs_rq contains only one entity, hence vruntime of
> + * the entity @v should always equal to the cfs_rq's weighted
> + * average vruntime @V, which means we will always re-weight
> + * at 0-lag point, thus breach assumption. Proof completed.
> + *
> + *
> + * COROLLARY #2: Re-weight does NOT affect weighted average
> + * vruntime of all the entities.
> + *
> + * Proof: According to corollary #1, Eq. (1) should be:
> + *
> + * (V - v)*w = (V' - v')*w'
> + * ==> v' = V' - (V - v)*w/w' (4)
> + *
> + * According to the weighted average formula, we have:
> + *
> + * V' = (WV - wv + w'v') / (W - w + w')
> + * = (WV - wv + w'(V' - (V - v)w/w')) / (W - w + w')
> + * = (WV - wv + w'V' - Vw + wv) / (W - w + w')
> + * = (WV + w'V' - Vw) / (W - w + w')
> + *
> + * ==> V'*(W - w + w') = WV + w'V' - Vw
> + * ==> V' * (W - w) = (W - w) * V (5)
> + *
> + * If the entity is the only one in the cfs_rq, then reweight
> + * always occurs at 0-lag point, so V won't change. Or else
> + * there are other entities, hence W != w, then Eq. (5) turns
> + * into V' = V. So V won't change in either case, proof done.
> + *
> + *
> + * So according to corollary #1 & #2, the effect of re-weight
> + * on vruntime should be:
> + *
> + * v' = V' - (V - v) * w / w' (4)
> + * = V - (V - v) * w / w'
> + * = V - vl * w / w'
> + * = V - vl'
> + */
> + if (avruntime != se->vruntime) {
> + vlag = (s64)(avruntime - se->vruntime);
> + vlag = div_s64(vlag * old_weight, weight);
> + se->vruntime = avruntime - vlag;
> + }
> +
> + /*
> + * DEADLINE
> + * ========
> + *
> + * When the weight changes, the virtual time slope changes and
> + * we should adjust the relative virtual deadline accordingly.
> + *
> + * d' = v' + (d - v)*w/w'
> + * = V' - (V - v)*w/w' + (d - v)*w/w'
> + * = V - (V - v)*w/w' + (d - v)*w/w'
> + * = V + (d - V)*w/w'
> + */
> + vslice = (s64)(se->deadline - avruntime);
> + vslice = div_s64(vslice * old_weight, weight);
> + se->deadline = avruntime + vslice;
> +}
> +
> static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> unsigned long weight)
> {
> - unsigned long old_weight = se->load.weight;
> + bool curr = cfs_rq->curr == se;
>
> if (se->on_rq) {
> /* commit outstanding execution time */
> - if (cfs_rq->curr == se)
> + if (curr)
> update_curr(cfs_rq);
> else
> - avg_vruntime_sub(cfs_rq, se);
> + __dequeue_entity(cfs_rq, se);
> update_load_sub(&cfs_rq->load, se->load.weight);
> }
> dequeue_load_avg(cfs_rq, se);
>
> - update_load_set(&se->load, weight);
> -
> if (!se->on_rq) {
> /*
> * Because we keep se->vlag = V - v_i, while: lag_i = w_i*(V - v_i),
> * we need to scale se->vlag when w_i changes.
> */
> - se->vlag = div_s64(se->vlag * old_weight, weight);
> + se->vlag = div_s64(se->vlag * se->load.weight, weight);
> } else {
> - s64 deadline = se->deadline - se->vruntime;
> - /*
> - * When the weight changes, the virtual time slope changes and
> - * we should adjust the relative virtual deadline accordingly.
> - */
> - deadline = div_s64(deadline * old_weight, weight);
> - se->deadline = se->vruntime + deadline;
> - if (se != cfs_rq->curr)
> - min_deadline_cb_propagate(&se->run_node, NULL);
> + reweight_eevdf(cfs_rq, se, weight);
> }
>
> + update_load_set(&se->load, weight);
> +
> #ifdef CONFIG_SMP
> do {
> u32 divider = get_pelt_divider(&se->avg);
> @@ -3712,8 +3811,17 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> update_load_add(&cfs_rq->load, se->load.weight);
> - if (cfs_rq->curr != se)
> - avg_vruntime_add(cfs_rq, se);
> + if (!curr) {
> + /*
> + * The entity's vruntime has been adjusted, so let's check
> + * whether the rq-wide min_vruntime needs updated too. Since
> + * the calculations above require stable min_vruntime rather
> + * than up-to-date one, we do the update at the end of the
> + * reweight process.
> + */
> + __enqueue_entity(cfs_rq, se);
> + update_min_vruntime(cfs_rq);
> + }
> }
> }
>
> @@ -3857,14 +3965,11 @@ static void update_cfs_group(struct sched_entity *se)
>
> #ifndef CONFIG_SMP
> shares = READ_ONCE(gcfs_rq->tg->shares);
> -
> - if (likely(se->load.weight == shares))
> - return;
> #else
> - shares = calc_group_shares(gcfs_rq);
> + shares = calc_group_shares(gcfs_rq);
> #endif
> -
> - reweight_entity(cfs_rq_of(se), se, shares);
> + if (unlikely(se->load.weight != shares))
> + reweight_entity(cfs_rq_of(se), se, shares);
> }
>
> #else /* CONFIG_FAIR_GROUP_SCHED */

2023-11-18 05:19:21

by Abel Wu

[permalink] [raw]

Subject: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 11/17/23 5:23 PM, Peter Zijlstra Wrote:
>
> Your email is pretty badly mangled by wrapping, please try and
> reconfigure your MUA, esp. the trace and debug output is unreadable.
>
> On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
>
>> The base scenario are two KVM guests running on an s390 LPAR. One guest
>> hosts the uperf server, one the uperf client.
>> With EEVDF we observe a regression of ~50% for a strburst test.
>> For a more detailed description of the setup see the section TEST SUMMARY at
>> the bottom.
>
> Well, that's not good :/
>
>> Short summary:
>> The mentioned kworker has been scheduled to CPU 14 before the tracing was
>> enabled.
>> A vhost process is migrated onto CPU 14.
>> The vruntimes of kworker and vhost differ significantly (86642125805 vs
>> 4242563284 -> factor 20)
>
> So bear with me, I know absolutely nothing about virt stuff. I suspect
> there's cgroups involved because shiny or something.
>
> kworkers are typically not in cgroups and are part of the root cgroup,
> but what's a vhost and where does it live?
>
> Also, what are their weights / nice values?
>
>> The vhost process wants to wake up the kworker, therefore the kworker is
>> placed onto the runqueue again and set to runnable.
>> The vhost process continues to execute, waking up other vhost processes on
>> other CPUs.
>>
>> So far this behavior is not different to what we see on pre-EEVDF kernels.
>>
>> On timestamp 576.162767, the vhost process triggers the last wake up of
>> another vhost on another CPU.
>> Until timestamp 576.171155, we see no other activity. Now, the vhost process
>> ends its time slice.
>> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
>> off to CPU 15.
>
> So why does this vhost stay on the CPU if it doesn't have anything to
> do? (I've not tried to make sense of the trace, that's just too
> painful).
>
>> This does not occur with older kernels.
>> The kworker has to wait for the migration to happen in order to be able to
>> execute again.
>> This is due to the fact, that the vruntime of the kworker is significantly
>> larger than the one of vhost.
>
> That's, weird. Can you add a trace_printk() to update_entity_lag() and
> have it print out the lag, limit and vlag (post clamping) values? And
> also in place_entity() for the reverse process, lag pre and post scaling
> or something.
>
> After confirming both tasks are indeed in the same cgroup ofcourse,
> because if they're not, vruntime will be meaningless to compare and we
> should look elsewhere.
>
> Also, what HZ and what preemption mode are you running? If kworker is
> somehow vastly over-shooting it's slice -- keeps running way past the
> avg_vruntime, then it will build up a giant lag and you get what you
> describe, next time it wakes up it gets placed far to the right (exactly
> where it was when it 'finally' went to sleep, relatively speaking).
>
>> We found some options which sound plausible but we are not sure if they are
>> valid or not:
>>
>> 1. The wake up path has a dependency on the vruntime metrics that now delays
>> the execution of the kworker.
>> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
>> which updates the way cfs_rq->min_vruntime and
>> cfs_rq->avg_runtime are set might have introduced an issue which is
>> uncovered with the commit mentioned above.
>
> Suppose you have a few tasks (of equal weight) on you virtual timeline
> like so:
>
> ---------+---+---+---+---+------
> ^ ^
> | `avg_vruntime
> `-min_vruntime
>
> Then the above would be more or less the relative placements of these
> values. avg_vruntime is the weighted average of the various vruntimes
> and is therefore always in the 'middle' of the tasks, and not somewhere
> out-there.
>
> min_vruntime is a monotonically increasing 'minimum' that's left-ish on
> the tree (there's a few cases where a new task can be placed left of
> min_vruntime and its no longer actuall the minimum, but whatever).
>
> These values should be relatively close to one another, depending
> ofcourse on the spread of the tasks. So I don't think this is causing
> trouble.
>
> Anyway, the big difference with lag based placement is that where
> previously tasks (that do not migrate) retain their old vruntime and on
> placing they get pulled forward to at least min_vruntime, so a task that
> wildly overshoots, but then doesn't run for significant time can still
> be overtaken and then when placed again be 'okay'.
>
> Now OTOH, with lag-based placement, we strictly preserve their relative
> offset vs avg_vruntime. So if they were *far* too the right when they go
> to sleep, they will again be there on placement.

Hi Peter, I'm a little confused here. As we adopt placement strategy #1
when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
Given that the weight doesn't change, we have:

vl' = vl

But in fact it is scaled on placement:

vl' = vl * W/(W + w)

Does this intended? And to illustrate my understanding of strategy #1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07f555857698..a24ef8b297ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5131,7 +5131,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
*
* EEVDF: placement strategy #1 / #2
*/
- if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) {
+ if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
struct sched_entity *curr = cfs_rq->curr;
unsigned long load;

@@ -5150,7 +5150,10 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* To avoid the 'w_i' term all over the place, we only track
* the virtual lag:
*
- * vl_i = V - v_i <=> v_i = V - vl_i
+ * vl_i = V' - v_i <=> v_i = V' - vl_i
+ *
+ * Where V' is the new weighted average after placing this
+ * entity, and v_i is its newly assigned vruntime.
*
* And we take V to be the weighted average of all v:
*
@@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* vl_i is given by:
*
* V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
- * = (W*V + w_i*(V - vl_i)) / (W + w_i)
- * = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
- * = (V*(W + w_i) - w_i*l) / (W + w_i)
- * = V - w_i*vl_i / (W + w_i)
- *
- * And the actual lag after adding an entity with vl_i is:
- *
- * vl'_i = V' - v_i
- * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
- * = vl_i - w_i*vl_i / (W + w_i)
- *
- * Which is strictly less than vl_i. So in order to preserve lag
- * we should inflate the lag before placement such that the
- * effective lag after placement comes out right.
- *
- * As such, invert the above relation for vl'_i to get the vl_i
- * we need to use such that the lag after placement is the lag
- * we computed before dequeue.
+ * = (W*V + w_i*(V' - vl_i)) / (W + w_i)
+ * = V - w_i*vl_i / W
*
- * vl'_i = vl_i - w_i*vl_i / (W + w_i)
- * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
- *
- * (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
- * = W*vl_i
- *
- * vl_i = (W + w_i)*vl'_i / W
*/
load = cfs_rq->avg_load;
if (curr && curr->on_rq)
load += scale_load_down(curr->load.weight);
-
- lag *= load + scale_load_down(se->load.weight);
if (WARN_ON_ONCE(!load))
load = 1;
- lag = div_s64(lag, load);
+
+ vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
}

se->vruntime = vruntime - lag;

2023-11-18 07:47:13

by Abel Wu

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 11/17/23 2:58 AM, Tobias Huschle Wrote:
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr -> cfs_rq nr_running
> ---
>     CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task: comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284 cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14 nr=0
>     CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm= pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14 nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14 nr=0
>     CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm= pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14 nr=0

As the following 2 lines indicates that vhost-2920 is on_rq so can be
picked as next, thus its cfs_rq must have at least one entity.

While the above 4 lines shows nr=0, so the "comm= pid=0" task(s) can't
be in the same cgroup with vhost-2920.

Say vhost is in cgroupA, and "comm= pid=0" task with sev=86640641980
is in cgroupB ...

>     CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait: comm=vhost-2920 pid=2941 delay=9958 [ns]
>     CPU 3/KVM-2950    [014] d....   576.161435: sched_switch: prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920 next_pid=2941 next_prio=120
>    vhost-2920-2941    [014] D....   576.161439: sched_waking: comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>    vhost-2920-2941    [014] d....   576.161446: sched_waking: comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>    vhost-2920-2941    [014] d....   576.161447: sched_place: comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0 avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> --> which is far larger than vhost vruntime of 4242563284

Here nr=1 means there is another entity in the same cfs_rq with the
newly woken kworker, but which? According to the vruntime, I would
assume kworker is in cgroupB.

2023-11-18 15:33:35

by Honglei Wang

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 2023/11/18 15:33, Abel Wu wrote:
> On 11/17/23 2:58 AM, Tobias Huschle Wrote:
>> #################### TRACE EXCERPT ####################
>> The sched_place trace event was added to the end of the place_entity
>> function and outputs:
>> sev -> sched_entity vruntime
>> sed -> sched_entity deadline
>> sel -> sched_entity vlag
>> avg -> cfs_rq avg_vruntime
>> min -> cfs_rq min_vruntime
>> cpu -> cpu of cfs_rq
>> nr -> cfs_rq nr_running
>> ---
>>      CPU 3/KVM-2950    [014] d....   576.161432: sched_migrate_task:
>> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
>> --> migrates task from cpu 15 to 14
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place:
>> comm=vhost-2920 pid=2941 sev=4242563284 sed=4245563284 sel=0
>> avg=4242563284 min=4242563284 cpu=14 nr=0
>> --> places vhost 2920 on CPU 14 with vruntime 4242563284
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=
>> pid=0 sev=16329848593 sed=16334604010 sel=0 avg=16329848593
>> min=16329848593 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161433: sched_place: comm=
>> pid=0 sev=42560661157 sed=42627443765 sel=0 avg=42560661157
>> min=42560661157 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm=
>> pid=0 sev=53846627372 sed=54125900099 sel=0 avg=53846627372
>> min=53846627372 cpu=14 nr=0
>>      CPU 3/KVM-2950    [014] d....   576.161434: sched_place: comm=
>> pid=0 sev=86640641980 sed=87255041979 sel=0 avg=86640641980
>> min=86640641980 cpu=14 nr=0
>
> As the following 2 lines indicates that vhost-2920 is on_rq so can be
> picked as next, thus its cfs_rq must have at least one entity.
>
> While the above 4 lines shows nr=0, so the "comm= pid=0" task(s) can't
> be in the same cgroup with vhost-2920.
>
> Say vhost is in cgroupA, and "comm= pid=0" task with sev=86640641980
> is in cgroupB ...
>
This looks like an hierarchy enqueue staff. The temporary trace can get
comm and pid of vhost-2920, but failed for the other 4. I think the
reason is they were just se but not tasks. Seems this came from the
for_each_sched_entity(se) when doing enqueue vhost-2920. And the last
one with cfs_rq vruntime=86640641980 might be the root cgroup which was
on same level with kworkers.

So just from this tiny part of the trace log, there won't be thousands
ms level difference. Actually, it might be only 86642125805-86640641980
= 1.5 ms.

correct me if there is anything wrong..

Thanks,
Honglei
>>      CPU 3/KVM-2950    [014] dN...   576.161434: sched_stat_wait:
>> comm=vhost-2920 pid=2941 delay=9958 [ns]
>>      CPU 3/KVM-2950    [014] d....   576.161435: sched_switch:
>> prev_comm=CPU 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==>
>> next_comm=vhost-2920 next_pid=2941 next_prio=120
>>     vhost-2920-2941    [014] D....   576.161439: sched_waking:
>> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
>>     vhost-2920-2941    [014] d....   576.161446: sched_waking:
>> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
>>     vhost-2920-2941    [014] d....   576.161447: sched_place:
>> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
>> avg=86642125805 min=86642125805 cpu=14 nr=1
>> --> places kworker 6525 on cpu 14 with vruntime 86642125805
>> --> which is far larger than vhost vruntime of 4242563284
>
> Here nr=1 means there is another entity in the same cfs_rq with the
> newly woken kworker, but which? According to the vruntime, I would
> assume kworker is in cgroupB.

2023-11-19 13:30:35

by Bagas Sanjaya

[permalink] [raw]

Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On Thu, Nov 16, 2023 at 07:58:18PM +0100, Tobias Huschle wrote:
> Hi,
>
> when testing the EEVDF scheduler we stumbled upon a performance regression
> in a uperf scenario and would like to
> kindly ask for feedback on whether we are going into the right direction
> with our analysis so far.
>
> The base scenario are two KVM guests running on an s390 LPAR. One guest
> hosts the uperf server, one the uperf client.
> With EEVDF we observe a regression of ~50% for a strburst test.
> For a more detailed description of the setup see the section TEST SUMMARY at
> the bottom.
>
> Bisecting led us to the following commit which appears to introduce the
> regression:
> 86bfbb7ce4f6 sched/fair: Add lag based placement
>
> We then compared the last good commit we identified with a recent level of
> the devel branch.
> The issue still persists on 6.7 rc1 although there is some improvement (down
> from 62% regression to 49%)
>
> All analysis described further are based on a 6.6 rc7 kernel.
>
> We sampled perf data to get an idea on what is going wrong and ended up
> seeing an dramatic increase in the maximum
> wait times from 3ms up to 366ms. See section WAIT DELAYS below for more
> details.
>
> We then collected tracing data to get a better insight into what is going
> on.
> The trace excerpt in section TRACE EXCERPT shows one example (of multiple
> per test run) of the problematic scenario where
> a kworker(pid=6525) has to wait for 39,718 ms.
>
> Short summary:
> The mentioned kworker has been scheduled to CPU 14 before the tracing was
> enabled.
> A vhost process is migrated onto CPU 14.
> The vruntimes of kworker and vhost differ significantly (86642125805 vs
> 4242563284 -> factor 20)
> The vhost process wants to wake up the kworker, therefore the kworker is
> placed onto the runqueue again and set to runnable.
> The vhost process continues to execute, waking up other vhost processes on
> other CPUs.
>
> So far this behavior is not different to what we see on pre-EEVDF kernels.
>
> On timestamp 576.162767, the vhost process triggers the last wake up of
> another vhost on another CPU.
> Until timestamp 576.171155, we see no other activity. Now, the vhost process
> ends its time slice.
> Then, vhost gets re-assigned new time slices 4 times and gets then migrated
> off to CPU 15.
> This does not occur with older kernels.
> The kworker has to wait for the migration to happen in order to be able to
> execute again.
> This is due to the fact, that the vruntime of the kworker is significantly
> larger than the one of vhost.
>
>
> We observed the large difference in vruntime between kworker and vhost in
> the same magnitude on
> a kernel built based on the parent of the commit mentioned above.
> With EEVDF, the kworker is doomed to wait until the vhost either catches up
> on vruntime (which would take 86 seconds)
> or the vhost is migrated off of the CPU.
>
> We found some options which sound plausible but we are not sure if they are
> valid or not:
>
> 1. The wake up path has a dependency on the vruntime metrics that now delays
> the execution of the kworker.
> 2. The previous commit af4cf40470c2 (sched/fair: Add cfs_rq::avg_vruntime)
> which updates the way cfs_rq->min_vruntime and
> cfs_rq->avg_runtime are set might have introduced an issue which is
> uncovered with the commit mentioned above.
> 3. An assumption in the vhost code which causes vhost to rely on being
> scheduled off in time to allow the kworker to proceed.
>
> We also stumbled upon the following mailing thread:
> https://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> That conversation, and the patches derived from it lead to the assumption
> that the wake up path might be adjustable in a way
> that this case in particular can be addressed.
> At the same time, the vast difference in vruntimes is concerning since, at
> least for some time frame, both processes are on the runqueue.
>
> We would be glad to hear some feedback on which paths to pursue and which
> might just be a dead end in the first place.
>
>
> #################### TRACE EXCERPT ####################
> The sched_place trace event was added to the end of the place_entity
> function and outputs:
> sev -> sched_entity vruntime
> sed -> sched_entity deadline
> sel -> sched_entity vlag
> avg -> cfs_rq avg_vruntime
> min -> cfs_rq min_vruntime
> cpu -> cpu of cfs_rq
> nr -> cfs_rq nr_running
> ---
> CPU 3/KVM-2950 [014] d.... 576.161432: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=15 dest_cpu=14
> --> migrates task from cpu 15 to 14
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm=vhost-2920
> pid=2941 sev=4242563284 sed=4245563284 sel=0 avg=4242563284 min=4242563284
> cpu=14 nr=0
> --> places vhost 2920 on CPU 14 with vruntime 4242563284
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
> sev=16329848593 sed=16334604010 sel=0 avg=16329848593 min=16329848593 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161433: sched_place: comm= pid=0
> sev=42560661157 sed=42627443765 sel=0 avg=42560661157 min=42560661157 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
> sev=53846627372 sed=54125900099 sel=0 avg=53846627372 min=53846627372 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] d.... 576.161434: sched_place: comm= pid=0
> sev=86640641980 sed=87255041979 sel=0 avg=86640641980 min=86640641980 cpu=14
> nr=0
> CPU 3/KVM-2950 [014] dN... 576.161434: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=9958 [ns]
> CPU 3/KVM-2950 [014] d.... 576.161435: sched_switch: prev_comm=CPU
> 3/KVM prev_pid=2950 prev_prio=120 prev_state=S ==> next_comm=vhost-2920
> next_pid=2941 next_prio=120
> vhost-2920-2941 [014] D.... 576.161439: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] d.... 576.161446: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] d.... 576.161447: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642125805 sed=86645125805 sel=0
> avg=86642125805 min=86642125805 cpu=14 nr=1
> --> places kworker 6525 on cpu 14 with vruntime 86642125805
> --> which is far larger than vhost vruntime of 4242563284
> vhost-2920-2941 [014] d.... 576.161447: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=10143757 [ns]
> vhost-2920-2941 [014] dN... 576.161447: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] dN... 576.161448: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=13884 [ns] vruntime=4242577168 [ns]
> --> vhost 2920 finishes after 13884 ns of runtime
> vhost-2920-2941 [014] dN... 576.161448: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=0 [ns]
> vhost-2920-2941 [014] d.... 576.161448: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> switch to kworker
> kworker/14:0-6525 [014] d.... 576.161449: sched_waking: comm=CPU 2/KVM
> pid=2949 prio=120 target_cpu=007
> kworker/14:0-6525 [014] d.... 576.161450: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=3714 [ns] vruntime=86642129519 [ns]
> --> kworker finshes after 3714 ns of runtime
> kworker/14:0-6525 [014] d.... 576.161450: sched_stat_wait:
> comm=vhost-2920 pid=2941 delay=3714 [ns]
> kworker/14:0-6525 [014] d.... 576.161451: sched_switch:
> prev_comm=kworker/14:0 prev_pid=6525 prev_prio=120 prev_state=I ==>
> next_comm=vhost-2920 next_pid=2941 next_prio=120
> --> switch back to vhost
> vhost-2920-2941 [014] d.... 576.161478: sched_waking:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] d.... 576.161478: sched_place:
> comm=kworker/14:0 pid=6525 sev=86642191859 sed=86645191859 sel=-1150
> avg=86642188144 min=86642188144 cpu=14 nr=1
> --> kworker placed again on cpu 14 with vruntime 86642191859, the problem
> occurs only if lag <= 0, having lag=0 does not always hit the problem though
> vhost-2920-2941 [014] d.... 576.161478: sched_stat_blocked:
> comm=kworker/14:0 pid=6525 delay=27943 [ns]
> vhost-2920-2941 [014] d.... 576.161479: sched_wakeup:
> comm=kworker/14:0 pid=6525 prio=120 target_cpu=014
> vhost-2920-2941 [014] D.... 576.161511: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161512: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] D.... 576.161516: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161773: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.161775: sched_waking:
> comm=vhost-2286 pid=2309 prio=120 target_cpu=008
> vhost-2920-2941 [014] D.... 576.162103: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.162105: sched_waking:
> comm=vhost-2286 pid=2307 prio=120 target_cpu=021
> vhost-2920-2941 [014] D.... 576.162326: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
> vhost-2920-2941 [014] D.... 576.162437: sched_waking:
> comm=vhost-2286 pid=2308 prio=120 target_cpu=006
> vhost-2920-2941 [014] D.... 576.162767: sched_waking:
> comm=vhost-2286 pid=2305 prio=120 target_cpu=004
> vhost-2920-2941 [014] d.h.. 576.171155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9704465 [ns] vruntime=4252281633 [ns]
> vhost-2920-2941 [014] d.h.. 576.181155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000377 [ns] vruntime=4262282010 [ns]
> vhost-2920-2941 [014] d.h.. 576.191154: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=9999514 [ns] vruntime=4272281524 [ns]
> vhost-2920-2941 [014] d.h.. 576.201155: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=10000246 [ns] vruntime=4282281770 [ns]
> --> vhost gets rescheduled multiple times because its vruntime is
> significantly smaller than the vruntime of the kworker
> vhost-2920-2941 [014] dNh.. 576.201176: sched_wakeup:
> comm=migration/14 pid=85 prio=0 target_cpu=014
> vhost-2920-2941 [014] dN... 576.201191: sched_stat_runtime:
> comm=vhost-2920 pid=2941 runtime=25190 [ns] vruntime=4282306960 [ns]
> vhost-2920-2941 [014] d.... 576.201192: sched_switch:
> prev_comm=vhost-2920 prev_pid=2941 prev_prio=120 prev_state=R+ ==>
> next_comm=migration/14 next_pid=85 next_prio=0
> migration/14-85 [014] d..1. 576.201194: sched_migrate_task:
> comm=vhost-2920 pid=2941 prio=120 orig_cpu=14 dest_cpu=15
> --> vhost gets migrated off of cpu 14
> migration/14-85 [014] d..1. 576.201194: sched_place: comm=vhost-2920
> pid=2941 sev=3198666923 sed=3201666923 sel=0 avg=3198666923 min=3198666923
> cpu=15 nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=12775683594 sed=12779398224 sel=0 avg=12775683594 min=12775683594 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=33655559178 sed=33661025369 sel=0 avg=33655559178 min=33655559178 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201195: sched_place: comm= pid=0
> sev=42240572785 sed=42244083642 sel=0 avg=42240572785 min=42240572785 cpu=15
> nr=0
> migration/14-85 [014] d..1. 576.201196: sched_place: comm= pid=0
> sev=70190876523 sed=70194789898 sel=-13068763 avg=70190876523
> min=70190876523 cpu=15 nr=0
> migration/14-85 [014] d.... 576.201198: sched_stat_wait:
> comm=kworker/14:0 pid=6525 delay=39718472 [ns]
> migration/14-85 [014] d.... 576.201198: sched_switch:
> prev_comm=migration/14 prev_pid=85 prev_prio=0 prev_state=S ==>
> next_comm=kworker/14:0 next_pid=6525 next_prio=120
> --> only now, kworker is eligible to run again, after a delay of 39718472
> ns
> kworker/14:0-6525 [014] d.... 576.201200: sched_waking: comm=CPU 0/KVM
> pid=2947 prio=120 target_cpu=012
> kworker/14:0-6525 [014] d.... 576.201290: sched_stat_runtime:
> comm=kworker/14:0 pid=6525 runtime=92941 [ns] vruntime=86642284800 [ns]
>
> #################### WAIT DELAYS - PERF LATENCY ####################
> last good commit --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Avg delay ms | Max
> delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> CPU 2/KVM:(2) | 5399.650 ms | 108698 | avg: 0.003 ms | max:
> 3.077 ms | max start: 544.090322 s | max end: 544.093399 s
> CPU 7/KVM:(2) | 5111.132 ms | 69632 | avg: 0.003 ms | max:
> 2.980 ms | max start: 544.690994 s | max end: 544.693974 s
> kworker/22:3-ev:723 | 342.944 ms | 63417 | avg: 0.005 ms | max:
> 1.880 ms | max start: 545.235430 s | max end: 545.237310 s
> CPU 0/KVM:(2) | 8171.431 ms | 433099 | avg: 0.003 ms | max:
> 1.004 ms | max start: 547.970344 s | max end: 547.971348 s
> CPU 1/KVM:(2) | 5486.260 ms | 258702 | avg: 0.003 ms | max:
> 1.002 ms | max start: 548.782514 s | max end: 548.783516 s
> CPU 5/KVM:(2) | 4766.143 ms | 65727 | avg: 0.003 ms | max:
> 0.997 ms | max start: 545.313610 s | max end: 545.314607 s
> vhost-2268:(6) | 13206.503 ms | 315030 | avg: 0.003 ms | max:
> 0.989 ms | max start: 550.887761 s | max end: 550.888749 s
> vhost-2892:(6) | 14467.268 ms | 214005 | avg: 0.003 ms | max:
> 0.981 ms | max start: 545.213819 s | max end: 545.214800 s
> CPU 3/KVM:(2) | 5538.908 ms | 85105 | avg: 0.003 ms | max:
> 0.883 ms | max start: 547.138139 s | max end: 547.139023 s
> CPU 6/KVM:(2) | 5289.827 ms | 72301 | avg: 0.003 ms | max:
> 0.836 ms | max start: 551.094590 s | max end: 551.095425 s
>
> 6.6 rc7 --> perf sched latency -s max
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Avg delay ms | Max
> delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> kworker/19:2-ev:1071 | 69.482 ms | 12700 | avg: 0.050 ms | max:
> 366.314 ms | max start: 54705.674294 s | max end: 54706.040607 s
> kworker/13:1-ev:184 | 78.048 ms | 14645 | avg: 0.067 ms | max:
> 287.738 ms | max start: 54710.312863 s | max end: 54710.600602 s
> kworker/12:1-ev:46148 | 138.488 ms | 26660 | avg: 0.021 ms | max:
> 147.414 ms | max start: 54706.133161 s | max end: 54706.280576 s
> kworker/16:2-ev:33076 | 149.175 ms | 29491 | avg: 0.026 ms | max:
> 139.752 ms | max start: 54708.410845 s | max end: 54708.550597 s
> CPU 3/KVM:(2) | 1934.714 ms | 41896 | avg: 0.007 ms | max:
> 92.126 ms | max start: 54713.158498 s | max end: 54713.250624 s
> kworker/7:2-eve:17001 | 68.164 ms | 11820 | avg: 0.045 ms | max:
> 69.717 ms | max start: 54707.100903 s | max end: 54707.170619 s
> kworker/17:1-ev:46510 | 68.804 ms | 13328 | avg: 0.037 ms | max:
> 67.894 ms | max start: 54711.022711 s | max end: 54711.090605 s
> kworker/21:1-ev:45782 | 68.906 ms | 13215 | avg: 0.021 ms | max:
> 59.473 ms | max start: 54709.351135 s | max end: 54709.410608 s
> ksoftirqd/17:101 | 0.041 ms | 2 | avg: 25.028 ms | max:
> 50.047 ms | max start: 54711.040578 s | max end: 54711.090625 s
>
> #################### TEST SUMMARY ####################
> Setup description:
> - single KVM host with 2 identical guests
> - guests are connected virtually via Open vSwitch
> - guests run uperf streaming read workload with 50 parallel connections
> - one guests acts as uperf client, the other one as uperf server
>
> Regression:
> kernel-6.5.0-rc2: 78 Gb/s (before 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.5.0-rc2: 29 Gb/s (with 86bfbb7ce4f6 sched/fair: Add lag based
> placement)
> kernel-6.7.0-rc1: 41 Gb/s
>
> KVM host:
> - 12 dedicated IFLs, SMT-2 (24 Linux CPUs)
> - 64 GiB memory
> - FEDORA 38
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
>
> KVM guests:
> - 8 vCPUs
> - 8 GiB memory
> - RHEL 9.2
> - kernel: 5.14.0-162.6.1.el9_1.s390x
> - kernel commandline: transparent_hugepage=never audit_enable=0 audit=0
> audit_debug=0 selinux=0
>
> Open vSwitch:
> - Open vSwitch with 2 ports, each with mtu=32768 and qlen=15000
> - Open vSwitch ports attached to guests via virtio-net
> - each guest has 4 vhost-queues
>
> Domain xml snippet for Open vSwitch port:
> <interface type="bridge" dev="OVS">
> <source bridge="vswitch0"/>
> <mac address="02:bb:97:28:02:02"/>
> <virtualport type="openvswitch"/>
> <model type="virtio"/>
> <target dev="vport1"/>
> <driver name="vhost" queues="4"/>
> <address type="ccw" cssid="0xfe" ssid="0x0" devno="0x0002"/>
> </interface>
>
> Benchmark: uperf
> - workload: str-readx30k, 50 active parallel connections
> - uperf server permanently sends data in 30720-byte chunks
> - uperf client receives and acknowledges this data
> - Server: uperf -s
> - Client: uperf -a -i 30 -m uperf.xml
>
> uperf.xml:
> <?xml version="1.0"?>
> <profile name="strburst">
> <group nprocs="50">
> <transaction iterations="1">
> <flowop type="connect" options="remotehost=10.161.28.3 protocol=tcp
> "/>
> </transaction>
> <transaction duration="300">
> <flowop type="read" options="count=640 size=30k"/>
> </transaction>
> <transaction iterations="1">
> <flowop type="disconnect" />
> </transaction>
> </group>
> </profile>

Thanks for the regression report. I'm adding it to regzbot:

#regzbot ^introduced: 86bfbb7ce4f67a

--
An old man doll... just what I always wanted! - Clara

Attachments:

(No filename) (18.29 kB)
signature.asc (235.00 B)
Download all attachments

2023-11-20 10:57:33

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:

> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
> Given that the weight doesn't change, we have:
>
> vl' = vl
>
> But in fact it is scaled on placement:
>
> vl' = vl * W/(W + w)

(W+w)/W

>
> Does this intended?

The scaling, yes that's intended and the comment explains why. So now
you have me confused too :-)

Specifically, I want the lag after placement to be equal to the lag we
come in with. Since placement will affect avg_vruntime (adding one
element to the average changes the average etc..) the placement also
affects the lag as measured after placement.

Or rather, if you enqueue and dequeue, I want the lag to be preserved.
If you do not take placement into consideration, lag will dissipate real
quick.

> And to illustrate my understanding of strategy #1:

> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> * vl_i is given by:
> *
> * V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
> - * = (W*V + w_i*(V - vl_i)) / (W + w_i)
> - * = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
> - * = (V*(W + w_i) - w_i*l) / (W + w_i)
> - * = V - w_i*vl_i / (W + w_i)
> - *
> - * And the actual lag after adding an entity with vl_i is:
> - *
> - * vl'_i = V' - v_i
> - * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
> - * = vl_i - w_i*vl_i / (W + w_i)
> - *
> - * Which is strictly less than vl_i. So in order to preserve lag
> - * we should inflate the lag before placement such that the
> - * effective lag after placement comes out right.
> - *
> - * As such, invert the above relation for vl'_i to get the vl_i
> - * we need to use such that the lag after placement is the lag
> - * we computed before dequeue.
> + * = (W*V + w_i*(V' - vl_i)) / (W + w_i)
> + * = V - w_i*vl_i / W
> *
> - * vl'_i = vl_i - w_i*vl_i / (W + w_i)
> - * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
> - *
> - * (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
> - * = W*vl_i
> - *
> - * vl_i = (W + w_i)*vl'_i / W
> */
> load = cfs_rq->avg_load;
> if (curr && curr->on_rq)
> load += scale_load_down(curr->load.weight);
> -
> - lag *= load + scale_load_down(se->load.weight);
> if (WARN_ON_ONCE(!load))
> load = 1;
> - lag = div_s64(lag, load);
> +
> + vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
> }
> se->vruntime = vruntime - lag;

So you're proposing we do:

v = V - (lag * w) / (W + w) - lag

?

That can be written like:

v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
= V - (lag * (W+w) + lag * w) / (W+w)
= V - (lag * (W+2w)) / (W+w)

And that turns into a mess AFAICT.

Let me repeat my earlier argument. Suppose v,w,l are the new element.
V,W are the old avg_vruntime and sum-weight.

Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).

The new lag, after placement:

l' = V' - v = (V*W + v*w) / (W+w) - v
= (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
= (V*W + v*w -v*W - v*w) / (W+w)
= (V*W - v*W) / (W+w)
= W*(V-v) / (W+w)
= W/(W+w) * (V-v)

Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:

l' = W/(W+w) * (V - (V - (W+w)/W * l))
= W/(W+w) * (V - V + (W+w)/W * l)
= W/(W+w) * (W+w)/W * l
= l

So by scaling, we've preserved lag across placement.

That make sense?

2023-11-20 12:06:50

by Abel Wu

[permalink] [raw]

Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 11/20/23 6:56 PM, Peter Zijlstra Wrote:
> On Sat, Nov 18, 2023 at 01:14:32PM +0800, Abel Wu wrote:
>
>> Hi Peter, I'm a little confused here. As we adopt placement strategy #1
>> when PLACE_LAG is enabled, the lag of that entity needs to be preserved.
>> Given that the weight doesn't change, we have:
>>
>> vl' = vl
>>
>> But in fact it is scaled on placement:
>>
>> vl' = vl * W/(W + w)
>
> (W+w)/W

Ah, right. I misunderstood (again) the comment which says:

vl_i = (W + w_i)*vl'_i / W

So the current implementation is:

v' = V - vl'

and what I was proposing is:

v' = V' - vl

and they are equal in fact.

>
>>
>> Does this intended?
>
> The scaling, yes that's intended and the comment explains why. So now
> you have me confused too :-)
>
> Specifically, I want the lag after placement to be equal to the lag we
> come in with. Since placement will affect avg_vruntime (adding one
> element to the average changes the average etc..) the placement also
> affects the lag as measured after placement.

Yes. You did the math in an iterative fashion and mine is facing the
final state:

v' = V' - vlag
V' = (WV + wv') / (W + w)

which gives:

V' = V - w * vlag / W

>
> Or rather, if you enqueue and dequeue, I want the lag to be preserved.
> If you do not take placement into consideration, lag will dissipate real
> quick.
>
>> And to illustrate my understanding of strategy #1:
>
>> @@ -5162,41 +5165,17 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> * vl_i is given by:
>> *
>> * V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
>> - * = (W*V + w_i*(V - vl_i)) / (W + w_i)
>> - * = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
>> - * = (V*(W + w_i) - w_i*l) / (W + w_i)
>> - * = V - w_i*vl_i / (W + w_i)
>> - *
>> - * And the actual lag after adding an entity with vl_i is:
>> - *
>> - * vl'_i = V' - v_i
>> - * = V - w_i*vl_i / (W + w_i) - (V - vl_i)
>> - * = vl_i - w_i*vl_i / (W + w_i)
>> - *
>> - * Which is strictly less than vl_i. So in order to preserve lag
>> - * we should inflate the lag before placement such that the
>> - * effective lag after placement comes out right.
>> - *
>> - * As such, invert the above relation for vl'_i to get the vl_i
>> - * we need to use such that the lag after placement is the lag
>> - * we computed before dequeue.
>> + * = (W*V + w_i*(V' - vl_i)) / (W + w_i)
>> + * = V - w_i*vl_i / W
>> *
>> - * vl'_i = vl_i - w_i*vl_i / (W + w_i)
>> - * = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
>> - *
>> - * (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
>> - * = W*vl_i
>> - *
>> - * vl_i = (W + w_i)*vl'_i / W
>> */
>> load = cfs_rq->avg_load;
>> if (curr && curr->on_rq)
>> load += scale_load_down(curr->load.weight);
>> -
>> - lag *= load + scale_load_down(se->load.weight);
>> if (WARN_ON_ONCE(!load))
>> load = 1;
>> - lag = div_s64(lag, load);
>> +
>> + vruntime -= div_s64(lag * scale_load_down(se->load.weight), load);
>> }
>> se->vruntime = vruntime - lag;
>
>
> So you're proposing we do:
>
> v = V - (lag * w) / (W + w) - lag

What I 'm proposing is:

V' = V - w * vlag / W

so we have:

v' = V' - vlag
= V - vlag * w/W - vlag
= V - vlag * (W + w)/W

which is exactly the same as current implementation.

>
> ?
>
> That can be written like:
>
> v = V - (lag * w) / (W+w) - (lag * (W+w)) / (W+w)
> = V - (lag * (W+w) + lag * w) / (W+w)
> = V - (lag * (W+2w)) / (W+w)
>
> And that turns into a mess AFAICT.
>
>
> Let me repeat my earlier argument. Suppose v,w,l are the new element.
> V,W are the old avg_vruntime and sum-weight.
>
> Then: V = V*W / W, and by extention: V' = (V*W + v*w) / (W + w).
>
> The new lag, after placement:
>
> l' = V' - v = (V*W + v*w) / (W+w) - v
> = (V*W + v*w) / (W+w) - v * (W+w) / (W+v)
> = (V*W + v*w -v*W - v*w) / (W+w)
> = (V*W - v*W) / (W+w)
> = W*(V-v) / (W+w)
> = W/(W+w) * (V-v)
>
> Substitute: v = V - (W+w)/W * l, my scaling thing, to obtain:
>
> l' = W/(W+w) * (V - (V - (W+w)/W * l))
> = W/(W+w) * (V - V + (W+w)/W * l)
> = W/(W+w) * (W+w)/W * l
> = l
>
> So by scaling, we've preserved lag across placement.
>
> That make sense?

Yes, I think I won't misunderstand again for the 3rd time :)

Thanks!
Abel

2023-11-22 10:02:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:

> We applied both suggested patch options and ran the test again, so
>
> sched/eevdf: Fix vruntime adjustment on reweight
> sched/fair: Update min_vruntime for reweight_entity() correctly
>
> and
>
> sched/eevdf: Delay dequeue
>
> Unfortunately, both variants do NOT fix the problem.
> The regression remains unchanged.

Thanks for testing.

> I will continue getting myself familiar with how cgroups are scheduled to dig
> deeper here. If there are any other ideas, I'd be happy to use them as a
> starting point for further analysis.
>
> Would additional traces still be of interest? If so, I would be glad to
> provide them.

So, since it got bisected to the placement logic, but is a cgroup
related issue, I was thinking that 'Delay dequeue' might not cut it,
that only works for tasks, not the internal entities.

The below should also work for internal entities, but last time I poked
around with it I had some regressions elsewhere -- you know, fix one,
wreck another type of situations on hand.

But still, could you please give it a go -- it applies cleanly to linus'
master and -rc2.

---
Subject: sched/eevdf: Revenge of the Sith^WSleepers

For tasks that have received excess service (negative lag) allow them to
gain parity (zero lag) by sleeping.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 6 ++++++
2 files changed, 42 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..b975e4b07a68 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5110,6 +5110,33 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}

#endif /* CONFIG_SMP */

+static inline u64
+entity_vlag_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+ u64 now, vdelta;
+ s64 delta;
+
+ if (!(flags & ENQUEUE_WAKEUP))
+ return se->vlag;
+
+ if (flags & ENQUEUE_MIGRATED)
+ return 0;
+
+ now = rq_clock_task(rq_of(cfs_rq));
+ delta = now - se->exec_start;
+ if (delta < 0)
+ return se->vlag;
+
+ if (sched_feat(GENTLE_SLEEPER))
+ delta /= 2;
+
+ vdelta = __calc_delta(delta, NICE_0_LOAD, &cfs_rq->load);
+ if (vdelta < -se->vlag)
+ return se->vlag + vdelta;
+
+ return 0;
+}
+
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
@@ -5133,6 +5160,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

lag = se->vlag;

+ /*
+ * Allow tasks that have received too much service (negative
+ * lag) to (re)gain parity (zero lag) by sleeping for the
+ * equivalent duration. This ensures they will be readily
+ * eligible.
+ */
+ if (sched_feat(PLACE_SLEEPER) && lag < 0)
+ lag = entity_vlag_sleeper(cfs_rq, se, flags);
+
/*
* If we want to place a task and preserve lag, we have to
* consider the effect of the new entity on the weighted
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..722282d3ed07 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,12 @@
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
SCHED_FEAT(RUN_TO_PARITY, true)
+/*
+ * Let sleepers earn back lag, but not more than 0-lag. GENTLE_SLEEPERS earn at
+ * half the speed.
+ */
+SCHED_FEAT(PLACE_SLEEPER, true)
+SCHED_FEAT(GENTLE_SLEEPER, true)

/*
* Prefer to schedule the task we woke last (assuming it failed

2023-11-28 08:56:30

by Abel Wu

[permalink] [raw]

Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
>> On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
>>
>> The below should also work for internal entities, but last time I poked
>> around with it I had some regressions elsewhere -- you know, fix one,
>> wreck another type of situations on hand.
>>
>> But still, could you please give it a go -- it applies cleanly to linus'
>> master and -rc2.
>>
>> ---
>> Subject: sched/eevdf: Revenge of the Sith^WSleepers
>>
>
> Tried the patch, it does not help unfortuntately.
>
> It might also be possible that the long running vhost is stuck on something.
> During those phases where the vhost just runs for a while. This might have
> been a risk for a while, EEVDF might have just uncovered an unfortuntate
> sequence of events.
> I'll look into this option.
>
> I also added some more trace outputs in order to find the actual vruntimes
> of the cgroup parents. The numbers look kind of reasonable, but I struggle
> to judge this with certainty.
>
> In one of the scenarios where the kworker sees an absurd wait time, the
> following occurs (full trace below):
>
> - The kworker ends its timeslice after 4941 ns
> - __pick_eevdf finds the cgroup holding vhost as the next option to execute
> - Last known values are:
> vruntime deadline
> cgroup 56117619190 57650477291 -> depth 0
> kworker 56117624131 56120619190
> This is fair, since the kworker is not runnable here.
> - At depth 4, the cgroup shows the observed vruntime value which is smaller
> by a factor of 20, but depth 0 seems to be running with values of the
> correct magnitude.

A child is running means its parent also being the cfs->curr, but
not vice versa if there are more than one child.

> - cgroup at depth 0 has zero lag, with higher depth, there are large lag
> values (as observed 606.338267 onwards)

These values of se->vlag means 'run this entity to parity' to avoid
excess context switch, which is what RUN_TO_PARITY does, or nothing
when !RUN_TO_PARITY. In short, se->vlag is not vlag when se->on_rq.

>
> Now the following occurs, triggered by the vhost:
> - The kworker gets placed again with:
> vruntime deadline
> cgroup 56117619190 57650477291 -> depth 0, last known value
> kworker 56117885776 56120885776 -> lag of -725
> - vhost continues executing and updates its vruntime accordingly, here
> I would need to enhance the trace to also print the vruntimes of the
> parent sched_entities to see the progress of their vruntime/deadline/lag
> values as well
> - It is a bit irritating that the EEVDF algorithm would not pick the kworker
> over the cgroup as its deadline is smaller.
> But, the kworker has negative lag, which might cause EEVDF to not pick
> the kworker.
> The cgroup at depth 0 has no lag, all deeper layers have a significantly
> positve lag (last known values, might have changed in the meantime).
> At this point I would see the option that the vhost task is stuck
> somewhere or EEVDF just does not see the kworker as a eligible option.

IMHO such lag should not introduce that long delay. Can you run the
test again with NEXT_BUDDY disabled?

>
> - Once the vhost is migrated off the cpu, the update_entity_lag function
> works with the following values at 606.467022: sched_update
> For the cgroup at depth 0
> - vruntime = 57104166665 --> this is in line with the amount of new timeslices
> vhost got assigned while the kworker was waiting
> - vlag = -62439022 --> the scheduler knows that the cgroup was
> overconsuming, but no runtime for the kworker
> For the cfs_rq we have
> - min_vruntime = 56117885776 --> this matches the vruntime of the kworker
> - avg_vruntime = 161750065796 --> this is rather large in comparison, but I
> might access this value at a bad time

Use avg_vruntime() instead.

> - nr_running = 2 --> at this point, both, cgroup and kworker are
> still on the queue, with the cgroup being
> in the migration process
> --> It seems like the overconsumption accumulates at cgroup depth 0 and is not
> propageted downwards. This might be intended though.
>
> - At 606.479979: sched_place, cgroup hosting the vhost is migrated back
> onto cpu 13 with a lag of -166821875 it gets scheduled right away as
> there is no other task (nr_running = 0)
>
> - At 606.479996: sched_place, the kworker gets placed again, this time
> with no lag and get scheduled almost immediately, with a wait
> time of 1255 ns.
>
> It shall be noted, that these scenarios also occur when the first placement
> of the kworker in this sequence has no lag, i.e. a lag <= 0 is the pattern
> when observing this issue.
>
> ######################### full trace #########################
>
> sched_bestvnode: v=vruntime,d=deadline,l=vlag,md=min_deadline,dp=depth
> --> during __pick_eevdf, prints values for best and the first node loop variable, second loop is never executed
>
> sched_place/sched_update: sev=se->vruntime,sed=se->deadline,sev=se->vlag,avg=cfs_rq->avg_vruntime,min=cfs_rq->min_vruntime

It would be better replace cfs_rq->avg_vruntime with avg_vruntime().
Although we can get real @avg by (vruntime + vlag), I am not sure
vlag (@lag in trace) is se->vlag or the local variable in the place
function which is scaled and no longer be the true vlag.

> --> at the end of place_entity and update_entity_lag
>
> --> the chunks of 5 entries for these new events represent the 5 levels of the cgroup which hosts the vhost
>
> vhost-2931-2953 [013] d.... 606.338262: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=90133345 [ns]
> vhost-2931-2953 [013] d.... 606.338262: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
> vhost-2931-2953 [013] dN... 606.338263: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> vhost-2931-2953 [013] dN... 606.338263: sched_bestvnode: best: id=0 v=56117619190 d=57650477291 l=0 md=56121178745 dp=0 node: id=168 v=56117619190 d=56120619190 l=0 md=56120619190 dp=0
> vhost-2931-2953 [013] dN... 606.338263: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=17910 [ns] vruntime=2099190650 [ns] deadline=2102172740 [ns] lag=2102172740
> vhost-2931-2953 [013] dN... 606.338264: sched_stat_wait: comm=kworker/13:1 pid=168 delay=0 [ns]
> vhost-2931-2953 [013] d.... 606.338264: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
> --> kworker allowed to execute
> kworker/13:1-168 [013] d.... 606.338266: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
> kworker/13:1-168 [013] d.... 606.338267: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=4941 [ns] vruntime=56117624131 [ns] deadline=56120619190 [ns] lag=56120619190
> --> runtime of 4941 ns
> kworker/13:1-168 [013] d.... 606.338267: sched_update: comm=kworker/13:1 pid=168 sev=56117624131 sed=56120619190 sel=-725 avg=0 min=56117619190 cpu=13 nr=2 lag=-725 lim=10000000
> kworker/13:1-168 [013] d.... 606.338267: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117619190 d=57650477291 l=0 md=57650477291 dp=0
> --> depth 0 of cgroup holding vhost: vruntime deadline
> cgroup 56117619190 57650477291
> kworker 56117624131 56120619190
> kworker/13:1-168 [013] d.... 606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=29822481776 d=29834647752 l=29834647752 md=29834647752 dp=1
> kworker/13:1-168 [013] d.... 606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=21909608438 d=21919458955 l=21919458955 md=21919458955 dp=2
> kworker/13:1-168 [013] d.... 606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=11306038504 d=11312426915 l=11312426915 md=11312426915 dp=3
> kworker/13:1-168 [013] d.... 606.338268: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2099190650 d=2102172740 l=2102172740 md=2102172740 dp=4
> kworker/13:1-168 [013] d.... 606.338268: sched_stat_wait: comm=vhost-2931 pid=2953 delay=4941 [ns]
> kworker/13:1-168 [013] d.... 606.338269: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
> vhost-2931-2953 [013] d.... 606.338311: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> vhost-2931-2953 [013] d.... 606.338312: sched_place: comm=kworker/13:1 pid=168 sev=56117885776 sed=56120885776 sel=-725 avg=0 min=56117880833 cpu=13 nr=1 vru=56117880833 lag=-725
> --> kworker gets placed again
> vhost-2931-2953 [013] d.... 606.338312: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=44970 [ns]
> vhost-2931-2953 [013] d.... 606.338313: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> --> kworker set to runnable, but vhost keeps on executing

What are the weights of the two entities?

> vhost-2931-2953 [013] d.h.. 606.346964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=8697702 [ns] vruntime=2107888352 [ns] deadline=2110888352 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.356964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999583 [ns] vruntime=2117887935 [ns] deadline=2120887935 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.366964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000089 [ns] vruntime=2127888024 [ns] deadline=2130888024 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.376964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999716 [ns] vruntime=2137887740 [ns] deadline=2140887740 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.386964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000179 [ns] vruntime=2147887919 [ns] deadline=2150887919 [ns] lag=2102172740
> vhost-2931-2953 [013] D.... 606.392250: sched_waking: comm=vhost-2306 pid=2324 prio=120 target_cpu=018
> vhost-2931-2953 [013] D.... 606.392388: sched_waking: comm=vhost-2306 pid=2321 prio=120 target_cpu=017
> vhost-2931-2953 [013] D.... 606.392390: sched_migrate_task: comm=vhost-2306 pid=2321 prio=120 orig_cpu=17 dest_cpu=23
> vhost-2931-2953 [013] d.h.. 606.396964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000187 [ns] vruntime=2157888106 [ns] deadline=2160888106 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.406964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000112 [ns] vruntime=2167888218 [ns] deadline=2170888218 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.416964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999779 [ns] vruntime=2177887997 [ns] deadline=2180887997 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.426964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999667 [ns] vruntime=2187887664 [ns] deadline=2190887664 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.436964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000329 [ns] vruntime=2197887993 [ns] deadline=2200887993 [ns] lag=2102172740
> vhost-2931-2953 [013] D.... 606.441980: sched_waking: comm=vhost-2306 pid=2325 prio=120 target_cpu=021
> vhost-2931-2953 [013] d.h.. 606.446964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=10000069 [ns] vruntime=2207888062 [ns] deadline=2210888062 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.456964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999977 [ns] vruntime=2217888039 [ns] deadline=2220888039 [ns] lag=2102172740
> vhost-2931-2953 [013] d.h.. 606.466964: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=9999548 [ns] vruntime=2227887587 [ns] deadline=2230887587 [ns] lag=2102172740
> vhost-2931-2953 [013] dNh.. 606.466979: sched_wakeup: comm=migration/13 pid=80 prio=0 target_cpu=013
> vhost-2931-2953 [013] dN... 606.467017: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=41352 [ns] vruntime=2227928939 [ns] deadline=2230887587 [ns] lag=2102172740
> vhost-2931-2953 [013] d.... 606.467018: sched_switch: prev_comm=vhost-2931 prev_pid=2953 prev_prio=120 prev_state=R+ ==> next_comm=migration/13 next_pid=80 next_prio=0
> migration/13-80 [013] d..1. 606.467020: sched_update: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230887587 sel=0 avg=0 min=2227928939 cpu=13 nr=1 lag=0 lim=10000000
> migration/13-80 [013] d..1. 606.467021: sched_update: comm= pid=0 sev=12075393889 sed=12087868931 sel=0 avg=0 min=12075393889 cpu=13 nr=1 lag=0 lim=42139916
> migration/13-80 [013] d..1. 606.467021: sched_update: comm= pid=0 sev=23017543001 sed=23036322254 sel=0 avg=0 min=23017543001 cpu=13 nr=1 lag=0 lim=63209874
> migration/13-80 [013] d..1. 606.467021: sched_update: comm= pid=0 sev=30619368612 sed=30633124735 sel=0 avg=0 min=30619368612 cpu=13 nr=1 lag=0 lim=46126124
> migration/13-80 [013] d..1. 606.467022: sched_update: comm= pid=0 sev=57104166665 sed=57945071818 sel=-62439022 avg=161750065796 min=56117885776 cpu=13 nr=2 lag=-62439022 lim=62439022
> --> depth 0 of cgroup holding vhost: vruntime deadline
> cgroup 57104166665 57945071818
> kworker 56117885776 56120885776 --> last known values
> --> cgroup's lag of -62439022 indicates that the scheduler knows that the cgroup ran for too long
> --> nr=2 shows that the cgroup and the kworker are currently on the runqueue
> migration/13-80 [013] d..1. 606.467022: sched_migrate_task: comm=vhost-2931 pid=2953 prio=120 orig_cpu=13 dest_cpu=12
> migration/13-80 [013] d..1. 606.467023: sched_place: comm=vhost-2931 pid=2953 sev=2994881412 sed=2997881412 sel=0 avg=0 min=2994881412 cpu=12 nr=0 vru=2994881412 lag=0
> migration/13-80 [013] d..1. 606.467023: sched_place: comm= pid=0 sev=16617220304 sed=16632657489 sel=0 avg=0 min=16617220304 cpu=12 nr=0 vru=16617220304 lag=0
> migration/13-80 [013] d..1. 606.467024: sched_place: comm= pid=0 sev=30778525102 sed=30804781512 sel=0 avg=0 min=30778525102 cpu=12 nr=0 vru=30778525102 lag=0
> migration/13-80 [013] d..1. 606.467024: sched_place: comm= pid=0 sev=38704326194 sed=38724404624 sel=0 avg=0 min=38704326194 cpu=12 nr=0 vru=38704326194 lag=0
> migration/13-80 [013] d..1. 606.467025: sched_place: comm= pid=0 sev=66383057731 sed=66409091628 sel=-30739032 avg=0 min=66383057731 cpu=12 nr=0 vru=66383057731 lag=0
> --> vhost migrated off to CPU 12
> migration/13-80 [013] d.... 606.467026: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56117885776 d=56120885776 l=-725 md=56120885776 dp=0
> migration/13-80 [013] d.... 606.467026: sched_stat_wait: comm=kworker/13:1 pid=168 delay=128714004 [ns]
> migration/13-80 [013] d.... 606.467027: sched_switch: prev_comm=migration/13 prev_pid=80 prev_prio=0 prev_state=S ==> next_comm=kworker/13:1 next_pid=168 next_prio=120
> --> kworker runs next
> kworker/13:1-168 [013] d.... 606.467030: sched_waking: comm=CPU 0/KVM pid=2958 prio=120 target_cpu=009
> kworker/13:1-168 [013] d.... 606.467032: sched_stat_runtime: comm=kworker/13:1 pid=168 runtime=6163 [ns] vruntime=56117891939 [ns] deadline=56120885776 [ns] lag=56120885776
> kworker/13:1-168 [013] d.... 606.467032: sched_update: comm=kworker/13:1 pid=168 sev=56117891939 sed=56120885776 sel=0 avg=0 min=56117891939 cpu=13 nr=1 lag=0 lim=10000000
> kworker/13:1-168 [013] d.... 606.467033: sched_switch: prev_comm=kworker/13:1 prev_pid=168 prev_prio=120 prev_state=I ==> next_comm=swapper/13 next_pid=0 next_prio=120
> --> kworker finishes
> <idle>-0 [013] d.h.. 606.479977: sched_place: comm=vhost-2931 pid=2953 sev=2227928939 sed=2230928939 sel=0 avg=0 min=2227928939 cpu=13 nr=0 vru=2227928939 lag=0
> --> vhost migrated back and placed on CPU 13 again
> <idle>-0 [013] d.h.. 606.479977: sched_stat_sleep: comm=vhost-2931 pid=2953 delay=27874 [ns]
> <idle>-0 [013] d.h.. 606.479977: sched_place: comm= pid=0 sev=12075393889 sed=12099393888 sel=0 avg=0 min=12075393889 cpu=13 nr=0 vru=12075393889 lag=0
> <idle>-0 [013] d.h.. 606.479978: sched_place: comm= pid=0 sev=23017543001 sed=23056927616 sel=0 avg=0 min=23017543001 cpu=13 nr=0 vru=23017543001 lag=0
> <idle>-0 [013] d.h.. 606.479978: sched_place: comm= pid=0 sev=30619368612 sed=30648907073 sel=0 avg=0 min=30619368612 cpu=13 nr=0 vru=30619368612 lag=0
> <idle>-0 [013] d.h.. 606.479979: sched_place: comm= pid=0 sev=56117891939 sed=56168252594 sel=-166821875 avg=0 min=56117891939 cpu=13 nr=0 vru=56117891939 lag=0
> <idle>-0 [013] dNh.. 606.479979: sched_wakeup: comm=vhost-2931 pid=2953 prio=120 target_cpu=013
> <idle>-0 [013] dN... 606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=56117891939 d=56168252594 l=-166821875 md=56168252594 dp=0
> --> depth 0 of cgroup holding vhost: vruntime deadline
> cgroup 56117891939 56168252594
> kworker 56117891939 56120885776
> <idle>-0 [013] dN... 606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=30619368612 d=30648907073 l=0 md=30648907073 dp=1
> <idle>-0 [013] dN... 606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=23017543001 d=23056927616 l=0 md=23056927616 dp=2
> <idle>-0 [013] dN... 606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=0 v=12075393889 d=12099393888 l=0 md=12099393888 dp=3
> <idle>-0 [013] dN... 606.479981: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=2953 v=2227928939 d=2230928939 l=0 md=2230928939 dp=4
> <idle>-0 [013] dN... 606.479982: sched_stat_wait: comm=vhost-2931 pid=2953 delay=0 [ns]
> <idle>-0 [013] d.... 606.479982: sched_switch: prev_comm=swapper/13 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=vhost-2931 next_pid=2953 next_prio=120
> --> vhost can continue to bully the kworker
> vhost-2931-2953 [013] d.... 606.479995: sched_waking: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> vhost-2931-2953 [013] d.... 606.479996: sched_place: comm=kworker/13:1 pid=168 sev=56118220659 sed=56121220659 sel=0 avg=0 min=56118220659 cpu=13 nr=1 vru=56118220659 lag=0
> vhost-2931-2953 [013] d.... 606.479996: sched_stat_blocked: comm=kworker/13:1 pid=168 delay=12964004 [ns]
> vhost-2931-2953 [013] d.... 606.479997: sched_wakeup: comm=kworker/13:1 pid=168 prio=120 target_cpu=013
> vhost-2931-2953 [013] d.... 606.479997: sched_stat_runtime: comm=vhost-2931 pid=2953 runtime=20837 [ns] vruntime=2227949776 [ns] deadline=2230928939 [ns] lag=2230928939
> vhost-2931-2953 [013] d.... 606.479997: sched_update: comm=vhost-2931 pid=2953 sev=2227949776 sed=2230928939 sel=0 avg=0 min=2227949776 cpu=13 nr=1 lag=0 lim=10000000
> vhost-2931-2953 [013] d.... 606.479998: sched_update: comm= pid=0 sev=12075560584 sed=12099393888 sel=0 avg=0 min=12075560584 cpu=13 nr=1 lag=0 lim=79999997
> vhost-2931-2953 [013] d.... 606.479998: sched_update: comm= pid=0 sev=23017816553 sed=23056927616 sel=0 avg=0 min=23017816553 cpu=13 nr=1 lag=0 lim=131282050
> vhost-2931-2953 [013] d.... 606.479998: sched_update: comm= pid=0 sev=30619573776 sed=30648907073 sel=0 avg=0 min=30619573776 cpu=13 nr=1 lag=0 lim=98461537
> vhost-2931-2953 [013] d.... 606.479998: sched_update: comm= pid=0 sev=56118241726 sed=56168252594 sel=-19883 avg=0 min=56118220659 cpu=13 nr=2 lag=-19883 lim=167868850
> vhost-2931-2953 [013] d.... 606.479999: sched_bestvnode: best: id=0 v=0 d=0 l=0 md=0 dp=131072 node: id=168 v=56118220659 d=56121220659 l=0 md=56121220659 dp=0
> vhost-2931-2953 [013] d.... 606.479999: sched_stat_wait: comm=kworker/13:1 pid=168 delay=1255 [ns]
> --> good delay of 1255 ns for the kworker
> --> depth 0 of cgroup holding vhost: vruntime deadline
> cgroup 56118241726 56168252594
> kworker 56118220659 56121220659

2023-12-07 06:49:24