The previously used CFS scheduler gave tasks that were woken up an
enhanced chance to see runtime immediately by deducting a certain value
from its vruntime on runqueue placement during wakeup.
This property was used by some, at least vhost, to ensure, that certain
kworkers are scheduled immediately after being woken up. The EEVDF
scheduler, does not support this so far. Instead, if such a woken up
entitiy carries a negative lag from its previous execution, it will have
to wait for the current time slice to finish, which affects the
performance of the process expecting the immediate execution negatively.
To address this issue, implement EEVDF strategy #2 for rejoining
entities, which dismisses the lag from previous execution and allows
the woken up task to run immediately (if no other entities are deemed
to be preferred for scheduling by EEVDF).
The vruntime is decremented by an additional value of 1 to make sure,
that the woken up tasks gets to actually run. This is of course not
following strategy #2 in an exact manner but guarantees the expected
behavior for the scenario described above. Without the additional
decrement, the performance goes south even more. So there are some
side effects I could not get my head around yet.
Questions:
1. The kworker getting its negative lag occurs in the following scenario
- kworker and a cgroup are supposed to execute on the same CPU
- one task within the cgroup is executing and wakes up the kworker
- kworker with 0 lag, gets picked immediately and finishes its
execution within ~5000ns
- on dequeue, kworker gets assigned a negative lag
Is this expected behavior? With this short execution time, I would
expect the kworker to be fine.
For a more detailed discussion on this symptom, please see:
https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
2. The proposed code change of course only addresses the symptom. Am I
assuming correctly that this is in general the exepected behavior and
that the task waking up the kworker should rather do an explicit
reschedule of itself to grant the kworker time to execute?
In the vhost case, this is currently attempted through a cond_resched
which is not doing anything because the need_resched flag is not set.
Feedback and opinions would be highly appreciated.
Signed-off-by: Tobias Huschle <[email protected]>
---
kernel/sched/fair.c | 5 +++++
kernel/sched/features.h | 1 +
2 files changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..c20ae6d62961 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
lag = div_s64(lag, load);
}
+ if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
+ se->vlag = 0;
+ lag = 1;
+ }
+
se->vruntime = vruntime - lag;
/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..d3118e7568b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(NOLAG_WAKEUP, true)
/*
* Prefer to schedule the task we woke last (assuming it failed
--
2.34.1
(+ Xuewen Yan, Ke Wang)
Hello Tobias,
On 2/28/2024 9:40 PM, Tobias Huschle wrote:
> The previously used CFS scheduler gave tasks that were woken up an
> enhanced chance to see runtime immediately by deducting a certain value
> from its vruntime on runqueue placement during wakeup.
>
> This property was used by some, at least vhost, to ensure, that certain
> kworkers are scheduled immediately after being woken up. The EEVDF
> scheduler, does not support this so far. Instead, if such a woken up
> entitiy carries a negative lag from its previous execution, it will have
> to wait for the current time slice to finish, which affects the
> performance of the process expecting the immediate execution negatively.
>
> To address this issue, implement EEVDF strategy #2 for rejoining
> entities, which dismisses the lag from previous execution and allows
> the woken up task to run immediately (if no other entities are deemed
> to be preferred for scheduling by EEVDF).
>
> The vruntime is decremented by an additional value of 1 to make sure,
> that the woken up tasks gets to actually run. This is of course not
> following strategy #2 in an exact manner but guarantees the expected
> behavior for the scenario described above. Without the additional
> decrement, the performance goes south even more. So there are some
> side effects I could not get my head around yet.
>
> Questions:
> 1. The kworker getting its negative lag occurs in the following scenario
> - kworker and a cgroup are supposed to execute on the same CPU
> - one task within the cgroup is executing and wakes up the kworker
> - kworker with 0 lag, gets picked immediately and finishes its
> execution within ~5000ns
> - on dequeue, kworker gets assigned a negative lag
> Is this expected behavior? With this short execution time, I would
> expect the kworker to be fine.
> For a more detailed discussion on this symptom, please see:
> https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
Does the lag clamping path from Xuewen Yan [1] work for the vhost case
mentioned in the thread? Instead of placing the task just behind the
0-lag point, clamping the lag seems to be more principled approach since
EEVDF already does it in update_entity_lag().
If the lag is still too large, maybe the above coupled with Peter's
delayed dequeue patch can help [2] (Note: tree is prone to force
updates)
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e62ef63a888c97188a977daddb72b61548da8417
> 2. The proposed code change of course only addresses the symptom. Am I
> assuming correctly that this is in general the exepected behavior and
> that the task waking up the kworker should rather do an explicit
> reschedule of itself to grant the kworker time to execute?
> In the vhost case, this is currently attempted through a cond_resched
> which is not doing anything because the need_resched flag is not set.
>
> Feedback and opinions would be highly appreciated.
>
> Signed-off-by: Tobias Huschle <[email protected]>
> ---
> kernel/sched/fair.c | 5 +++++
> kernel/sched/features.h | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..c20ae6d62961 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> lag = div_s64(lag, load);
> }
>
> + if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
> + se->vlag = 0;
> + lag = 1;
> + }
> +
> se->vruntime = vruntime - lag;
>
> /*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..d3118e7568b4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
> SCHED_FEAT(PLACE_LAG, true)
> SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
> SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(NOLAG_WAKEUP, true)
>
> /*
> * Prefer to schedule the task we woke last (assuming it failed
--
Thanks and Regards,
Prateek
On Thu, Feb 29, 2024 at 09:06:16AM +0530, K Prateek Nayak wrote:
> (+ Xuewen Yan, Ke Wang)
>
> Hello Tobias,
>
<...>
> >
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> > - kworker and a cgroup are supposed to execute on the same CPU
> > - one task within the cgroup is executing and wakes up the kworker
> > - kworker with 0 lag, gets picked immediately and finishes its
> > execution within ~5000ns
> > - on dequeue, kworker gets assigned a negative lag
> > Is this expected behavior? With this short execution time, I would
> > expect the kworker to be fine.
> > For a more detailed discussion on this symptom, please see:
> > https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
>
> Does the lag clamping path from Xuewen Yan [1] work for the vhost case
> mentioned in the thread? Instead of placing the task just behind the
> 0-lag point, clamping the lag seems to be more principled approach since
> EEVDF already does it in update_entity_lag().
>
> If the lag is still too large, maybe the above coupled with Peter's
> delayed dequeue patch can help [2] (Note: tree is prone to force
> updates)
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e62ef63a888c97188a977daddb72b61548da8417
I tried Peter's patches a while ago. Unfortunately, reducing the lag
is not sufficient in this particular case. The calling entity expects
the woken up kworker to run instantly.
In order to have a chance that the woken up kworker is scheduled right
away, the kworker must not have any negative lag. To guarantee it being
scheduled it should even have a positive lag which allows it to pass
all other entities on the queue.
Therefore I proposed to just wipe the negative lag in these cases,
which seems to map to strategy #2 of the underlying paper.
The other way to think about this would be:
The assumption, that woken up tasks get a high probability to run
is no longer valid. In that case, the entity triggering the wake
up has to explicitly give up the CPU. If there are no other tasks,
apart from the 2 involved so far, has good chances of being
scheduled. If the runqueue is busy, other tasks might intervene.
I keep playing around with these options, but potential side effects
are worrying me.
>
<...>
Hi Tobias,
On 2/28/24 16:10, Tobias Huschle wrote:
> The previously used CFS scheduler gave tasks that were woken up an
> enhanced chance to see runtime immediately by deducting a certain value
> from its vruntime on runqueue placement during wakeup.
>
> This property was used by some, at least vhost, to ensure, that certain
> kworkers are scheduled immediately after being woken up. The EEVDF
> scheduler, does not support this so far. Instead, if such a woken up
> entitiy carries a negative lag from its previous execution, it will have
> to wait for the current time slice to finish, which affects the
> performance of the process expecting the immediate execution negatively.
>
> To address this issue, implement EEVDF strategy #2 for rejoining
> entities, which dismisses the lag from previous execution and allows
> the woken up task to run immediately (if no other entities are deemed
> to be preferred for scheduling by EEVDF).
>
> The vruntime is decremented by an additional value of 1 to make sure,
> that the woken up tasks gets to actually run. This is of course not
> following strategy #2 in an exact manner but guarantees the expected
> behavior for the scenario described above. Without the additional
> decrement, the performance goes south even more. So there are some
> side effects I could not get my head around yet.
>
> Questions:
> 1. The kworker getting its negative lag occurs in the following scenario
> - kworker and a cgroup are supposed to execute on the same CPU
> - one task within the cgroup is executing and wakes up the kworker
> - kworker with 0 lag, gets picked immediately and finishes its
> execution within ~5000ns
> - on dequeue, kworker gets assigned a negative lag
> Is this expected behavior? With this short execution time, I would
> expect the kworker to be fine.
That strikes me as a bit odd as well. Have you been able to determine how a negative lag
is assigned to the kworker after such a short runtime?
I was looking at a different thread (https://lore.kernel.org/lkml/[email protected]/) that
uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular
vhost problem.
On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> On 2/28/24 16:10, Tobias Huschle wrote:
> >
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> > - kworker and a cgroup are supposed to execute on the same CPU
> > - one task within the cgroup is executing and wakes up the kworker
> > - kworker with 0 lag, gets picked immediately and finishes its
> > execution within ~5000ns
> > - on dequeue, kworker gets assigned a negative lag
> > Is this expected behavior? With this short execution time, I would
> > expect the kworker to be fine.
>
> That strikes me as a bit odd as well. Have you been able to determine how a negative lag
> is assigned to the kworker after such a short runtime?
>
I did some more trace reading though and found something.
What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
- vhost starts alone on an otherwise empty runqueue
- it seems like it never gets dequeued
(unless another unrelated task joins or migration hits)
- if vhost wakes up the kworker, the kworker gets selected
- vhost runtime > kworker runtime
--> kworker gets positive lag and gets selected immediately next time
What happens if it does go wrong:
From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If these
outliers reach critical values, it can happen, that
vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.
In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than usual
to finish. We speak of 4-digit to low 5-digit nanoseconds.
So, for these outliers, the scheduler extrapolates that the kworker
out-consumes the vhost and should be slowed down, although in the majority
of other cases this does not happen.
Therefore this particular usecase would profit from being able to ignore
such outliers, or being able to ignore a certain amount of difference in the
lag values, i.e. introduce some grace value around the average runtime for
which lag is not accounted. But not sure if I like that idea.
So the negative lag can be somewhat justified, but for this particular case
it leads to a problem where one outlier can cause havoc. As mentioned in the
vhost discussion, it could also be argued that the vhost should not rely on
the fact that the kworker gets always scheduled on wake up, since these
timing issues can always happen.
Hence, the two options:
- offer the alternative strategy which dismisses lag on wake up for workloads
where we know that a task usually finishes faster than others but should
not be punished by rare outliers (if that is predicatble, I don't know)
- require vhost to adress this issue on their side (if possible without
creating an armada of side effects)
(plus the third one mentioned above, but that requires a magic cutoff value, meh)
> I was looking at a different thread (https://lore.kernel.org/lkml/[email protected]/) that
> uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular
> vhost problem.
Yea, the numbers I see do not look very overflowy.
On 3/14/24 13:45, Tobias Huschle wrote:
> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>
>>> Questions:
>>> 1. The kworker getting its negative lag occurs in the following scenario
>>> - kworker and a cgroup are supposed to execute on the same CPU
>>> - one task within the cgroup is executing and wakes up the kworker
>>> - kworker with 0 lag, gets picked immediately and finishes its
>>> execution within ~5000ns
>>> - on dequeue, kworker gets assigned a negative lag
>>> Is this expected behavior? With this short execution time, I would
>>> expect the kworker to be fine.
>>
>> That strikes me as a bit odd as well. Have you been able to determine how a negative lag
>> is assigned to the kworker after such a short runtime?
>>
>
> I did some more trace reading though and found something.
>
> What I observed if everything runs regularly:
> - vhost and kworker run alternating on the same CPU
> - if the kworker is done, it leaves the runqueue
> - vhost wakes up the kworker if it needs it
> --> this means:
> - vhost starts alone on an otherwise empty runqueue
> - it seems like it never gets dequeued
> (unless another unrelated task joins or migration hits)
> - if vhost wakes up the kworker, the kworker gets selected
> - vhost runtime > kworker runtime
> --> kworker gets positive lag and gets selected immediately next time
>
> What happens if it does go wrong:
> From what I gather, there seem to be occasions where the vhost either
> executes suprisingly quick, or the kworker surprinsingly slow. If these
> outliers reach critical values, it can happen, that
> vhost runtime < kworker runtime
> which now causes the kworker to get the negative lag.
>
> In this case it seems like, that the vhost is very fast in waking up
> the kworker. And coincidentally, the kworker takes, more time than usual
> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>
> So, for these outliers, the scheduler extrapolates that the kworker
> out-consumes the vhost and should be slowed down, although in the majority
> of other cases this does not happen.
Thanks for providing the above details Tobias. It does seem like EEVDF is strict
about the eligibility checks and making tasks wait when their lags are negative, even
if just a little bit as in the case of the kworker.
There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/[email protected]/),
which would make EEVDF more like EVDF, though the deadline comparison would
probably still favor the vhost task instead of the kworker with the negative lag.
I'm not sure if you tried it, but I thought I'd mention it.
On 2024-03-18 15:45, Luis Machado wrote:
> On 3/14/24 13:45, Tobias Huschle wrote:
>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>>
>>>> Questions:
>>>> 1. The kworker getting its negative lag occurs in the following
>>>> scenario
>>>> - kworker and a cgroup are supposed to execute on the same CPU
>>>> - one task within the cgroup is executing and wakes up the
>>>> kworker
>>>> - kworker with 0 lag, gets picked immediately and finishes its
>>>> execution within ~5000ns
>>>> - on dequeue, kworker gets assigned a negative lag
>>>> Is this expected behavior? With this short execution time, I
>>>> would
>>>> expect the kworker to be fine.
>>>
>>> That strikes me as a bit odd as well. Have you been able to determine
>>> how a negative lag
>>> is assigned to the kworker after such a short runtime?
>>>
>>
>> I did some more trace reading though and found something.
>>
>> What I observed if everything runs regularly:
>> - vhost and kworker run alternating on the same CPU
>> - if the kworker is done, it leaves the runqueue
>> - vhost wakes up the kworker if it needs it
>> --> this means:
>> - vhost starts alone on an otherwise empty runqueue
>> - it seems like it never gets dequeued
>> (unless another unrelated task joins or migration hits)
>> - if vhost wakes up the kworker, the kworker gets selected
>> - vhost runtime > kworker runtime
>> --> kworker gets positive lag and gets selected immediately next
>> time
>>
>> What happens if it does go wrong:
>> From what I gather, there seem to be occasions where the vhost either
>> executes suprisingly quick, or the kworker surprinsingly slow. If
>> these
>> outliers reach critical values, it can happen, that
>> vhost runtime < kworker runtime
>> which now causes the kworker to get the negative lag.
>>
>> In this case it seems like, that the vhost is very fast in waking up
>> the kworker. And coincidentally, the kworker takes, more time than
>> usual
>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>>
>> So, for these outliers, the scheduler extrapolates that the kworker
>> out-consumes the vhost and should be slowed down, although in the
>> majority
>> of other cases this does not happen.
>
> Thanks for providing the above details Tobias. It does seem like EEVDF
> is strict
> about the eligibility checks and making tasks wait when their lags are
> negative, even
> if just a little bit as in the case of the kworker.
>
> There was a patch to disable the eligibility checks
> (https://lore.kernel.org/lkml/[email protected]/),
> which would make EEVDF more like EVDF, though the deadline comparison
> would
> probably still favor the vhost task instead of the kworker with the
> negative lag.
>
> I'm not sure if you tried it, but I thought I'd mention it.
Haven't seen that one yet. Unfortunately, it does not help to ignore the
eligibility.
I'm inclined to rather propose propose a documentation change, which
describes that tasks should not rely on woken up tasks being scheduled
immediately.
Changing things in the code to address for the specific scenario I'm
seeing seems to mostly create unwanted side effects and/or would require
the definition of some magic cut-off values.
On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <[email protected]> wrote:
>
> On 2024-03-18 15:45, Luis Machado wrote:
> > On 3/14/24 13:45, Tobias Huschle wrote:
> >> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> >>> On 2/28/24 16:10, Tobias Huschle wrote:
> >>>>
> >>>> Questions:
> >>>> 1. The kworker getting its negative lag occurs in the following
> >>>> scenario
> >>>> - kworker and a cgroup are supposed to execute on the same CPU
> >>>> - one task within the cgroup is executing and wakes up the
> >>>> kworker
> >>>> - kworker with 0 lag, gets picked immediately and finishes its
> >>>> execution within ~5000ns
> >>>> - on dequeue, kworker gets assigned a negative lag
> >>>> Is this expected behavior? With this short execution time, I
> >>>> would
> >>>> expect the kworker to be fine.
> >>>
> >>> That strikes me as a bit odd as well. Have you been able to determine
> >>> how a negative lag
> >>> is assigned to the kworker after such a short runtime?
> >>>
> >>
> >> I did some more trace reading though and found something.
> >>
> >> What I observed if everything runs regularly:
> >> - vhost and kworker run alternating on the same CPU
> >> - if the kworker is done, it leaves the runqueue
> >> - vhost wakes up the kworker if it needs it
> >> --> this means:
> >> - vhost starts alone on an otherwise empty runqueue
> >> - it seems like it never gets dequeued
> >> (unless another unrelated task joins or migration hits)
> >> - if vhost wakes up the kworker, the kworker gets selected
> >> - vhost runtime > kworker runtime
> >> --> kworker gets positive lag and gets selected immediately next
> >> time
> >>
> >> What happens if it does go wrong:
> >> From what I gather, there seem to be occasions where the vhost either
> >> executes suprisingly quick, or the kworker surprinsingly slow. If
> >> these
> >> outliers reach critical values, it can happen, that
> >> vhost runtime < kworker runtime
> >> which now causes the kworker to get the negative lag.
> >>
> >> In this case it seems like, that the vhost is very fast in waking up
> >> the kworker. And coincidentally, the kworker takes, more time than
> >> usual
> >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> >>
> >> So, for these outliers, the scheduler extrapolates that the kworker
> >> out-consumes the vhost and should be slowed down, although in the
> >> majority
> >> of other cases this does not happen.
> >
> > Thanks for providing the above details Tobias. It does seem like EEVDF
> > is strict
> > about the eligibility checks and making tasks wait when their lags are
> > negative, even
> > if just a little bit as in the case of the kworker.
> >
> > There was a patch to disable the eligibility checks
> > (https://lore.kernel.org/lkml/[email protected]/),
> > which would make EEVDF more like EVDF, though the deadline comparison
> > would
> > probably still favor the vhost task instead of the kworker with the
> > negative lag.
> >
> > I'm not sure if you tried it, but I thought I'd mention it.
>
> Haven't seen that one yet. Unfortunately, it does not help to ignore the
> eligibility.
>
> I'm inclined to rather propose propose a documentation change, which
> describes that tasks should not rely on woken up tasks being scheduled
> immediately.
Where do you see such an assumption ? Even before eevdf, there were
nothing that ensures such behavior. When using CFS (legacy or eevdf)
tasks, you can't know if the newly wakeup task will run 1st or not
>
> Changing things in the code to address for the specific scenario I'm
> seeing seems to mostly create unwanted side effects and/or would require
> the definition of some magic cut-off values.
>
>
On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <[email protected]> wrote:
> >
> > On 2024-03-18 15:45, Luis Machado wrote:
> > > On 3/14/24 13:45, Tobias Huschle wrote:
> > >> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> > >>> On 2/28/24 16:10, Tobias Huschle wrote:
> > >>>>
> > >>>> Questions:
> > >>>> 1. The kworker getting its negative lag occurs in the following
> > >>>> scenario
> > >>>> - kworker and a cgroup are supposed to execute on the same CPU
> > >>>> - one task within the cgroup is executing and wakes up the
> > >>>> kworker
> > >>>> - kworker with 0 lag, gets picked immediately and finishes its
> > >>>> execution within ~5000ns
> > >>>> - on dequeue, kworker gets assigned a negative lag
> > >>>> Is this expected behavior? With this short execution time, I
> > >>>> would
> > >>>> expect the kworker to be fine.
> > >>>
> > >>> That strikes me as a bit odd as well. Have you been able to determine
> > >>> how a negative lag
> > >>> is assigned to the kworker after such a short runtime?
> > >>>
> > >>
> > >> I did some more trace reading though and found something.
> > >>
> > >> What I observed if everything runs regularly:
> > >> - vhost and kworker run alternating on the same CPU
> > >> - if the kworker is done, it leaves the runqueue
> > >> - vhost wakes up the kworker if it needs it
> > >> --> this means:
> > >> - vhost starts alone on an otherwise empty runqueue
> > >> - it seems like it never gets dequeued
> > >> (unless another unrelated task joins or migration hits)
> > >> - if vhost wakes up the kworker, the kworker gets selected
> > >> - vhost runtime > kworker runtime
> > >> --> kworker gets positive lag and gets selected immediately next
> > >> time
> > >>
> > >> What happens if it does go wrong:
> > >> From what I gather, there seem to be occasions where the vhost either
> > >> executes suprisingly quick, or the kworker surprinsingly slow. If
> > >> these
> > >> outliers reach critical values, it can happen, that
> > >> vhost runtime < kworker runtime
> > >> which now causes the kworker to get the negative lag.
> > >>
> > >> In this case it seems like, that the vhost is very fast in waking up
> > >> the kworker. And coincidentally, the kworker takes, more time than
> > >> usual
> > >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> > >>
> > >> So, for these outliers, the scheduler extrapolates that the kworker
> > >> out-consumes the vhost and should be slowed down, although in the
> > >> majority
> > >> of other cases this does not happen.
> > >
> > > Thanks for providing the above details Tobias. It does seem like EEVDF
> > > is strict
> > > about the eligibility checks and making tasks wait when their lags are
> > > negative, even
> > > if just a little bit as in the case of the kworker.
> > >
> > > There was a patch to disable the eligibility checks
> > > (https://lore.kernel.org/lkml/[email protected]/),
> > > which would make EEVDF more like EVDF, though the deadline comparison
> > > would
> > > probably still favor the vhost task instead of the kworker with the
> > > negative lag.
> > >
> > > I'm not sure if you tried it, but I thought I'd mention it.
> >
> > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > eligibility.
> >
> > I'm inclined to rather propose propose a documentation change, which
> > describes that tasks should not rely on woken up tasks being scheduled
> > immediately.
>
> Where do you see such an assumption ? Even before eevdf, there were
> nothing that ensures such behavior. When using CFS (legacy or eevdf)
> tasks, you can't know if the newly wakeup task will run 1st or not
>
There was no guarantee of course. place_entity was reducing the vruntime of
woken up tasks though, giving it a slight boost, right?. For the scenario
that I observed, that boost was enough to make sure, that the woken up tasks
gets scheduled consistently. This might still not be true for all scenarios,
but in general EEVDF seems to be stricter with woken up tasks.
Dismissing the lag on wakeup also does obviously not guarantee getting
scheduled, as other tasks might still be involved.
The question would be if it should be explicitly mentioned somewhere that,
at this point, woken up tasks are not getting any special treatment and
noone should rely on that boost for woken up tasks.
> >
> > Changing things in the code to address for the specific scenario I'm
> > seeing seems to mostly create unwanted side effects and/or would require
> > the definition of some magic cut-off values.
> >
> >
On 3/20/24 07:04, Tobias Huschle wrote:
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
>> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <[email protected]> wrote:
>>>
>>> On 2024-03-18 15:45, Luis Machado wrote:
>>>> On 3/14/24 13:45, Tobias Huschle wrote:
>>>>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>>>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>>>>>
>>>>>>> Questions:
>>>>>>> 1. The kworker getting its negative lag occurs in the following
>>>>>>> scenario
>>>>>>> - kworker and a cgroup are supposed to execute on the same CPU
>>>>>>> - one task within the cgroup is executing and wakes up the
>>>>>>> kworker
>>>>>>> - kworker with 0 lag, gets picked immediately and finishes its
>>>>>>> execution within ~5000ns
>>>>>>> - on dequeue, kworker gets assigned a negative lag
>>>>>>> Is this expected behavior? With this short execution time, I
>>>>>>> would
>>>>>>> expect the kworker to be fine.
>>>>>>
>>>>>> That strikes me as a bit odd as well. Have you been able to determine
>>>>>> how a negative lag
>>>>>> is assigned to the kworker after such a short runtime?
>>>>>>
>>>>>
>>>>> I did some more trace reading though and found something.
>>>>>
>>>>> What I observed if everything runs regularly:
>>>>> - vhost and kworker run alternating on the same CPU
>>>>> - if the kworker is done, it leaves the runqueue
>>>>> - vhost wakes up the kworker if it needs it
>>>>> --> this means:
>>>>> - vhost starts alone on an otherwise empty runqueue
>>>>> - it seems like it never gets dequeued
>>>>> (unless another unrelated task joins or migration hits)
>>>>> - if vhost wakes up the kworker, the kworker gets selected
>>>>> - vhost runtime > kworker runtime
>>>>> --> kworker gets positive lag and gets selected immediately next
>>>>> time
>>>>>
>>>>> What happens if it does go wrong:
>>>>> From what I gather, there seem to be occasions where the vhost either
>>>>> executes suprisingly quick, or the kworker surprinsingly slow. If
>>>>> these
>>>>> outliers reach critical values, it can happen, that
>>>>> vhost runtime < kworker runtime
>>>>> which now causes the kworker to get the negative lag.
>>>>>
>>>>> In this case it seems like, that the vhost is very fast in waking up
>>>>> the kworker. And coincidentally, the kworker takes, more time than
>>>>> usual
>>>>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>>>>>
>>>>> So, for these outliers, the scheduler extrapolates that the kworker
>>>>> out-consumes the vhost and should be slowed down, although in the
>>>>> majority
>>>>> of other cases this does not happen.
>>>>
>>>> Thanks for providing the above details Tobias. It does seem like EEVDF
>>>> is strict
>>>> about the eligibility checks and making tasks wait when their lags are
>>>> negative, even
>>>> if just a little bit as in the case of the kworker.
>>>>
>>>> There was a patch to disable the eligibility checks
>>>> (https://lore.kernel.org/lkml/[email protected]/),
>>>> which would make EEVDF more like EVDF, though the deadline comparison
>>>> would
>>>> probably still favor the vhost task instead of the kworker with the
>>>> negative lag.
>>>>
>>>> I'm not sure if you tried it, but I thought I'd mention it.
>>>
>>> Haven't seen that one yet. Unfortunately, it does not help to ignore the
>>> eligibility.
>>>
>>> I'm inclined to rather propose propose a documentation change, which
>>> describes that tasks should not rely on woken up tasks being scheduled
>>> immediately.
>>
>> Where do you see such an assumption ? Even before eevdf, there were
>> nothing that ensures such behavior. When using CFS (legacy or eevdf)
>> tasks, you can't know if the newly wakeup task will run 1st or not
>>
>
> There was no guarantee of course. place_entity was reducing the vruntime of
> woken up tasks though, giving it a slight boost, right?. For the scenario
> that I observed, that boost was enough to make sure, that the woken up tasks
> gets scheduled consistently. This might still not be true for all scenarios,
> but in general EEVDF seems to be stricter with woken up tasks.
It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so
a task would have to satisfy both of those checks.
I think we have some special treatment for when a task initially joins the competition, in which
case we halve its slice. But I don't think there is any special treatment for woken tasks
anymore.
There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of
wake up preemptions under some conditions, under the RUN_TO_PARITY feature.
On Wed, 20 Mar 2024 at 08:04, Tobias Huschle <[email protected]> wrote:
>
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <[email protected]> wrote:
> > >
..
> > >
> > > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > > eligibility.
> > >
> > > I'm inclined to rather propose propose a documentation change, which
> > > describes that tasks should not rely on woken up tasks being scheduled
> > > immediately.
> >
> > Where do you see such an assumption ? Even before eevdf, there were
> > nothing that ensures such behavior. When using CFS (legacy or eevdf)
> > tasks, you can't know if the newly wakeup task will run 1st or not
> >
>
> There was no guarantee of course. place_entity was reducing the vruntime of
> woken up tasks though, giving it a slight boost, right?. For the scenario
It was rather the opposite, It was ensuring that long sleeping tasks
will not get too much bonus because of vruntime too far in the past.
This is similar although not exactly the same intent as the lag. The
bonus was up to 24ms previously whereas it's not more than a slice now
> that I observed, that boost was enough to make sure, that the woken up tasks
> gets scheduled consistently. This might still not be true for all scenarios,
> but in general EEVDF seems to be stricter with woken up tasks.
>
> Dismissing the lag on wakeup also does obviously not guarantee getting
> scheduled, as other tasks might still be involved.
>
> The question would be if it should be explicitly mentioned somewhere that,
> at this point, woken up tasks are not getting any special treatment and
> noone should rely on that boost for woken up tasks.
>
> > >
> > > Changing things in the code to address for the specific scenario I'm
> > > seeing seems to mostly create unwanted side effects and/or would require
> > > the definition of some magic cut-off values.
> > >
> > >
On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote:
> On Wed, 20 Mar 2024 at 08:04, Tobias Huschle <[email protected]> wrote:
> > There was no guarantee of course. place_entity was reducing the vruntime of
> > woken up tasks though, giving it a slight boost, right?. For the scenario
>
> It was rather the opposite, It was ensuring that long sleeping tasks
> will not get too much bonus because of vruntime too far in the past.
> This is similar although not exactly the same intent as the lag. The
> bonus was up to 24ms previously whereas it's not more than a slice now
>
I might have gotten this quite wrong then. I was looking at place_entity
and saw that non-initial placements get their vruntime reduced via
vruntime -= thresh;
which would mean that the placed task would have a vruntime smaller than
cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at:
af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime
If there was no such benefit for woken up tasks. Then the scenario I observed
is just conincidentally worse with EEVDF, which can happen when exchanging an
algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that
scenario.
On Thu, 21 Mar 2024 at 13:18, Tobias Huschle <[email protected]> wrote:
>
> On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote:
> > On Wed, 20 Mar 2024 at 08:04, Tobias Huschle <[email protected]> wrote:
> > > There was no guarantee of course. place_entity was reducing the vruntime of
> > > woken up tasks though, giving it a slight boost, right?. For the scenario
> >
> > It was rather the opposite, It was ensuring that long sleeping tasks
> > will not get too much bonus because of vruntime too far in the past.
> > This is similar although not exactly the same intent as the lag. The
> > bonus was up to 24ms previously whereas it's not more than a slice now
> >
>
> I might have gotten this quite wrong then. I was looking at place_entity
> and saw that non-initial placements get their vruntime reduced via
>
> vruntime -= thresh;
and then
se->vruntime = max_vruntime(se->vruntime, vruntime)
>
> which would mean that the placed task would have a vruntime smaller than
> cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at:
>
> af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime
>
> If there was no such benefit for woken up tasks. Then the scenario I observed
> is just conincidentally worse with EEVDF, which can happen when exchanging an
> algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that
> scenario.
On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote:
> and then
> se->vruntime = max_vruntime(se->vruntime, vruntime)
>
First things first, I was wrong to assume a "boost" in the CFS code. So I
dug a bit deeper and tried to pinpoint what the difference between CFS and
EEVDF actually is. I found the following:
Let's assume we have two tasks taking turns on a single CPU.
Task 1 is always runnable.
Task 2 gets woken up by task 1 and goes back to sleep when it is done.
This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and
task 1 runs again and we repeat.
Most of the time: runtime(task1) > runtime(task2)
Rare occasions: runtime(task1) < runtime(task2)
So, task 1 usually consumes more of its designated time slices until it gets
rescheduled by the wakeup of task2 than task 2 does. But both never consume
their full time slice. Rather the opposite, both run for low 5-digit ns or
less.
So something like this:
task 1 |----------| |---------| |------...
task 2 |----| |----|
This creates different behaviors under CFS and EEVDF:
### CFS ####################################
In CFS the difference in runtimes means that task 2 cannot catch up with
task 1 vruntime-wise
With every exchange between task 1 and task 2, task 2 falls back more on
vruntime. Once a difference in the magnitude of sysctl_sched_latency is
established, the difference remains stable due to the max handling in
place_entity.
Occasionally, task 2 may run longer than task 1. In those cases, it
will catch up slightly. But in the majority of cases, task 2 runs
shorter, thereby increasing the difference in vruntime.
This would explain why task 2 gets always scheduled immediately on wakeup.
### EEVDF ##################################
The rare occasions where task 2 runs longer than task 1 seem to cause
issues with EEVDF:
In the regular case where task 1 runs longer than task 2. Task 2 gets
a positive lag and is selected on wake up --> good.
In the irregular case where task 2 runs longer than task 1 task 2
now gets a negative lag and is no longer chosen on wakeup --> bad (in some cases).
This would explain why task 2 gets not selected on wake up occasionally.
### Summary ################################
So my wording, that a woken up task gets "boosted" was obviously wrong.
Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with
no chance of catching up. Leaving it with a smaller vruntime value.
EEVDF on the other hand, does not allow lag to accumulate if an entity, like
task 2 in this case, regularly dequeues itself. So it will always have
a lag with an upper boundary of whatever difference it encountered in
comparison to the runtime with task 1.
The patch below, allows tasks to accumulate lag over time. This fixes the
original regression, that made me stumble into this topic. But, this might
of course come with arbitrary side effects.
I'm not suggesting to actually implement this, but would like to confirm
whether my understanding is correct that this is the aspect where CFS and
EEVDF differ, where CFS is more aware of the past in this particular case
than EEVDF is.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 lag, limit;
SCHED_WARN_ON(!se->on_rq);
- lag = avg_vruntime(cfs_rq) - se->vruntime;
+ lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(lag, -limit, limit);