Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.
For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.
In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.
Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
Signed-off-by: Zhang Qiao <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
---
My last proposal was not yet correct as the exec_start was not always
reset after migrating a task. I finally found this solution which keeps
the long sleep detection to one place as well as the reset of se->exec_start.
kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++----------
1 file changed, 45 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f499e9a74b5..421173fde351 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
#endif
}
+static inline bool entity_is_long_sleeper(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq;
+ u64 sleep_time;
+
+ if (se->exec_start == 0)
+ return false;
+
+ cfs_rq = cfs_rq_of(se);
+
+ sleep_time = rq_clock_task(rq_of(cfs_rq));
+
+ /* Happen while migrating because of clock task divergence */
+ if (sleep_time <= se->exec_start)
+ return false;
+
+ sleep_time -= se->exec_start;
+ if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
+ return true;
+
+ return false;
+}
+
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
- u64 sleep_time;
/*
* The 'current' period is already promised to the current tasks,
@@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
/*
* Pull vruntime of the entity being placed to the base level of
- * cfs_rq, to prevent boosting it if placed backwards. If the entity
- * slept for a long time, don't even try to compare its vruntime with
- * the base as it may be too far off and the comparison may get
- * inversed due to s64 overflow.
- */
- sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
- if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+ * cfs_rq, to prevent boosting it if placed backwards.
+ * However, min_vruntime can advance much faster than real time, with
+ * the extreme being when an entity with the minimal weight always runs
+ * on the cfs_rq. If the waking entity slept for a long time, its
+ * vruntime difference from min_vruntime may overflow s64 and their
+ * comparison may get inversed, so ignore the entity's original
+ * vruntime in that case.
+ * The maximal vruntime speedup is given by the ratio of normal to
+ * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
+ * When placing a migrated waking entity, its exec_start has been set
+ * from a different rq. In order to take into account a possible
+ * divergence between new and prev rq's clocks task because of irq and
+ * stolen time, we take an additional margin.
+ * So, cutting off on the sleep time of
+ * 2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
+ * should be safe.
+ */
+ if (entity_is_long_sleeper(se))
se->vruntime = vruntime;
else
se->vruntime = max_vruntime(se->vruntime, vruntime);
@@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
+ /* Entity has migrated, no longer consider this task hot */
+ if (flags & ENQUEUE_MIGRATED)
+ se->exec_start = 0;
check_schedstat_required();
update_stats_enqueue_fair(cfs_rq, se, flags);
@@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
/* Tell new CPU we are migrated */
se->avg.last_update_time = 0;
- /* We have migrated, no longer consider this task hot */
- se->exec_start = 0;
-
update_scan_period(p, new_cpu);
}
@@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
lockdep_assert_rq_held(rq);
WARN_ON_ONCE(task_rq(p) != rq);
- activate_task(rq, p, ENQUEUE_NOCLOCK);
+ activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
check_preempt_curr(rq, p, 0);
}
--
2.34.1
在 2023/3/18 0:08, Vincent Guittot 写道:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
>
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
>
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
>
>
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
>
> My last proposal was not yet correct as the exec_start was not always
> reset after migrating a task. I finally found this solution which keeps
> the long sleep detection to one place as well as the reset of se->exec_start.
>
Tested-by: Zhang Qiao <[email protected]>
I have retested it with this version, and the result is fine.
-------
Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):
80.10 +- 1.22 seconds time elapsed ( +- 1.53% )
Thanks,
Zhang Qiao.
> kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++----------
> 1 file changed, 45 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f499e9a74b5..421173fde351 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
> #endif
> }
>
> +static inline bool entity_is_long_sleeper(struct sched_entity *se)
> +{
> + struct cfs_rq *cfs_rq;
> + u64 sleep_time;
> +
> + if (se->exec_start == 0)
> + return false;
> +
> + cfs_rq = cfs_rq_of(se);
> +
> + sleep_time = rq_clock_task(rq_of(cfs_rq));
> +
> + /* Happen while migrating because of clock task divergence */
> + if (sleep_time <= se->exec_start)
> + return false;
> +
> + sleep_time -= se->exec_start;
> + if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
> + return true;
> +
> + return false;
> +}
> +
> static void
> place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
> {
> u64 vruntime = cfs_rq->min_vruntime;
> - u64 sleep_time;
>
> /*
> * The 'current' period is already promised to the current tasks,
> @@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>
> /*
> * Pull vruntime of the entity being placed to the base level of
> - * cfs_rq, to prevent boosting it if placed backwards. If the entity
> - * slept for a long time, don't even try to compare its vruntime with
> - * the base as it may be too far off and the comparison may get
> - * inversed due to s64 overflow.
> - */
> - sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> - if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> + * cfs_rq, to prevent boosting it if placed backwards.
> + * However, min_vruntime can advance much faster than real time, with
> + * the extreme being when an entity with the minimal weight always runs
> + * on the cfs_rq. If the waking entity slept for a long time, its
> + * vruntime difference from min_vruntime may overflow s64 and their
> + * comparison may get inversed, so ignore the entity's original
> + * vruntime in that case.
> + * The maximal vruntime speedup is given by the ratio of normal to
> + * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
> + * When placing a migrated waking entity, its exec_start has been set
> + * from a different rq. In order to take into account a possible
> + * divergence between new and prev rq's clocks task because of irq and
> + * stolen time, we take an additional margin.
> + * So, cutting off on the sleep time of
> + * 2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
> + * should be safe.
> + */
> + if (entity_is_long_sleeper(se))
> se->vruntime = vruntime;
> else
> se->vruntime = max_vruntime(se->vruntime, vruntime);
> @@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> if (flags & ENQUEUE_WAKEUP)
> place_entity(cfs_rq, se, 0);
> + /* Entity has migrated, no longer consider this task hot */
> + if (flags & ENQUEUE_MIGRATED)
> + se->exec_start = 0;
>
> check_schedstat_required();
> update_stats_enqueue_fair(cfs_rq, se, flags);
> @@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> /* Tell new CPU we are migrated */
> se->avg.last_update_time = 0;
>
> - /* We have migrated, no longer consider this task hot */
> - se->exec_start = 0;
> -
> update_scan_period(p, new_cpu);
> }
>
> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
> lockdep_assert_rq_held(rq);
>
> WARN_ON_ONCE(task_rq(p) != rq);
> - activate_task(rq, p, ENQUEUE_NOCLOCK);
> + activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> check_preempt_curr(rq, p, 0);
> }
>
>
On 18/03/2023 08:45, Zhang Qiao wrote:
>
>
> 在 2023/3/18 0:08, Vincent Guittot 写道:
>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>> after a migration.
>>
>> For fixing this case, we delay the reset of se->exec_start after
>> placing the entity which se->exec_start to detect long sleeping task.
>>
>> In order to take into account a possible divergence between the clock_task
>> of 2 rqs, we increase the threshold to around 104 days.
>>
>>
>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> Signed-off-by: Zhang Qiao <[email protected]>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>>
>> My last proposal was not yet correct as the exec_start was not always
>> reset after migrating a task. I finally found this solution which keeps
>> the long sleep detection to one place as well as the reset of se->exec_start.
>>
>
> Tested-by: Zhang Qiao <[email protected]>
>
> I have retested it with this version, and the result is fine.
>
> -------
>
> Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):
>
> 80.10 +- 1.22 seconds time elapsed ( +- 1.53% )
[...]
>> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
>> lockdep_assert_rq_held(rq);
>>
>> WARN_ON_ONCE(task_rq(p) != rq);
>> - activate_task(rq, p, ENQUEUE_NOCLOCK);
>> + activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
>> check_preempt_curr(rq, p, 0);
Why not:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9bc1ab67aaa..96dd3a62e683 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7674,7 +7674,10 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
}
- if (!task_on_rq_migrating(p)) {
+ if (task_on_rq_migrating(p)) {
+ /* We have migrated, no longer consider this task hot */
+ se->exec_start = 0;
+ } else {
remove_entity_load_avg(se);
/*
@@ -8726,7 +8729,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
lockdep_assert_rq_held(rq);
WARN_ON_ONCE(task_rq(p) != rq);
- activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
+ activate_task(rq, p, ENQUEUE_NOCLOCK);
check_preempt_curr(rq, p, 0);
}
entity_is_long_sleeper() will bail early for these rq-migrating tasks
for which a long-sleep test would make little sense anyway.
Plus move_queued_task() (e.g. from sched_exex()) would be covered as well.
On Mon, 20 Mar 2023 at 13:29, Dietmar Eggemann <[email protected]> wrote:
>
> On 18/03/2023 08:45, Zhang Qiao wrote:
> >
> >
> > 在 2023/3/18 0:08, Vincent Guittot 写道:
> >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >> after a migration.
> >>
> >> For fixing this case, we delay the reset of se->exec_start after
> >> placing the entity which se->exec_start to detect long sleeping task.
> >>
> >> In order to take into account a possible divergence between the clock_task
> >> of 2 rqs, we increase the threshold to around 104 days.
> >>
> >>
> >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> Signed-off-by: Zhang Qiao <[email protected]>
> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> ---
> >>
> >> My last proposal was not yet correct as the exec_start was not always
> >> reset after migrating a task. I finally found this solution which keeps
> >> the long sleep detection to one place as well as the reset of se->exec_start.
> >>
> >
> > Tested-by: Zhang Qiao <[email protected]>
> >
> > I have retested it with this version, and the result is fine.
> >
> > -------
> >
> > Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):
> >
> > 80.10 +- 1.22 seconds time elapsed ( +- 1.53% )
>
> [...]
>
> >> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
> >> lockdep_assert_rq_held(rq);
> >>
> >> WARN_ON_ONCE(task_rq(p) != rq);
> >> - activate_task(rq, p, ENQUEUE_NOCLOCK);
> >> + activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> >> check_preempt_curr(rq, p, 0);
>
> Why not:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9bc1ab67aaa..96dd3a62e683 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7674,7 +7674,10 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
> se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> }
>
> - if (!task_on_rq_migrating(p)) {
> + if (task_on_rq_migrating(p)) {
> + /* We have migrated, no longer consider this task hot */
> + se->exec_start = 0;
mainly to keep the clear of se->exec_start = 0 in one place to ease
the maintenance
> + } else {
> remove_entity_load_avg(se);
>
> /*
> @@ -8726,7 +8729,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
> lockdep_assert_rq_held(rq);
>
> WARN_ON_ONCE(task_rq(p) != rq);
> - activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> + activate_task(rq, p, ENQUEUE_NOCLOCK);
> check_preempt_curr(rq, p, 0);
> }
>
>
> entity_is_long_sleeper() will bail early for these rq-migrating tasks
> for which a long-sleep test would make little sense anyway.
>
> Plus move_queued_task() (e.g. from sched_exex()) would be covered as well.
On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
>
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
>
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
>
>
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
> ---
Blergh, this just isn't going to be nice. I'll go queue this for
sched/urgent and then we can forget about this for a little while.
Thanks!
On 21/03/2023 11:02, Peter Zijlstra wrote:
> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>> after a migration.
>>
>> For fixing this case, we delay the reset of se->exec_start after
>> placing the entity which se->exec_start to detect long sleeping task.
>>
>> In order to take into account a possible divergence between the clock_task
>> of 2 rqs, we increase the threshold to around 104 days.
>>
>>
>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> Signed-off-by: Zhang Qiao <[email protected]>
>> Signed-off-by: Vincent Guittot <[email protected]>
>> ---
>
> Blergh, this just isn't going to be nice. I'll go queue this for
> sched/urgent and then we can forget about this for a little while.
>
> Thanks!
Don't we miss setting `se->exec_start = 0` for fair task in
move_queued_task()? ( ... and __migrate_swap_task())
https://lkml.kernel.org/r/[email protected]
On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> On 21/03/2023 11:02, Peter Zijlstra wrote:
> > On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >> after a migration.
> >>
> >> For fixing this case, we delay the reset of se->exec_start after
> >> placing the entity which se->exec_start to detect long sleeping task.
> >>
> >> In order to take into account a possible divergence between the clock_task
> >> of 2 rqs, we increase the threshold to around 104 days.
> >>
> >>
> >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> Signed-off-by: Zhang Qiao <[email protected]>
> >> Signed-off-by: Vincent Guittot <[email protected]>
> >> ---
> >
> > Blergh, this just isn't going to be nice. I'll go queue this for
> > sched/urgent and then we can forget about this for a little while.
> >
> > Thanks!
>
> Don't we miss setting `se->exec_start = 0` for fair task in
> move_queued_task()? ( ... and __migrate_swap_task())
>
> https://lkml.kernel.org/r/[email protected]
Ah, I see what you mean now... When I read your and Vincent's replies
earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
but your actual goal was to capure more sites.
Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
in that TASK_ON_RQ_MIGRATING already captures that.
An alternative is something like the below, that matches
deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
the class methods.
Hmm?
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (task_on_rq_migrating(p))
+ flags |= ENQUEUE_MIGRATED;
+
enqueue_task(rq, p, flags);
p->on_rq = TASK_ON_RQ_QUEUED;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
lockdep_assert_rq_held(rq);
WARN_ON_ONCE(task_rq(p) != rq);
- activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
+ activate_task(rq, p, ENQUEUE_NOCLOCK);
check_preempt_curr(rq, p, 0);
}
On Tue, 21 Mar 2023 at 11:50, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> > On 21/03/2023 11:02, Peter Zijlstra wrote:
> > > On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> > >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> > >> after a migration.
> > >>
> > >> For fixing this case, we delay the reset of se->exec_start after
> > >> placing the entity which se->exec_start to detect long sleeping task.
> > >>
> > >> In order to take into account a possible divergence between the clock_task
> > >> of 2 rqs, we increase the threshold to around 104 days.
> > >>
> > >>
> > >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > >> Signed-off-by: Zhang Qiao <[email protected]>
> > >> Signed-off-by: Vincent Guittot <[email protected]>
> > >> ---
> > >
> > > Blergh, this just isn't going to be nice. I'll go queue this for
> > > sched/urgent and then we can forget about this for a little while.
> > >
> > > Thanks!
> >
> > Don't we miss setting `se->exec_start = 0` for fair task in
> > move_queued_task()? ( ... and __migrate_swap_task())
> >
> > https://lkml.kernel.org/r/[email protected]
>
> Ah, I see what you mean now... When I read your and Vincent's replies
> earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> but your actual goal was to capure more sites.
>
> Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> in that TASK_ON_RQ_MIGRATING already captures that.
>
> An alternative is something like the below, that matches
> deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> the class methods.
>
> Hmm?
Yes, this seems to be the right way to set ENQUEUE_MIGRATED flags
>
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
>
> void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (task_on_rq_migrating(p))
> + flags |= ENQUEUE_MIGRATED;
> +
> enqueue_task(rq, p, flags);
>
> p->on_rq = TASK_ON_RQ_QUEUED;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
> lockdep_assert_rq_held(rq);
>
> WARN_ON_ONCE(task_rq(p) != rq);
> - activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> + activate_task(rq, p, ENQUEUE_NOCLOCK);
> check_preempt_curr(rq, p, 0);
> }
>
On 21/03/2023 11:49, Peter Zijlstra wrote:
> On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
>> On 21/03/2023 11:02, Peter Zijlstra wrote:
>>> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
>>>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>>>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>>>> after a migration.
>>>>
>>>> For fixing this case, we delay the reset of se->exec_start after
>>>> placing the entity which se->exec_start to detect long sleeping task.
>>>>
>>>> In order to take into account a possible divergence between the clock_task
>>>> of 2 rqs, we increase the threshold to around 104 days.
>>>>
>>>>
>>>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>>>> Signed-off-by: Zhang Qiao <[email protected]>
>>>> Signed-off-by: Vincent Guittot <[email protected]>
>>>> ---
>>>
>>> Blergh, this just isn't going to be nice. I'll go queue this for
>>> sched/urgent and then we can forget about this for a little while.
>>>
>>> Thanks!
>>
>> Don't we miss setting `se->exec_start = 0` for fair task in
>> move_queued_task()? ( ... and __migrate_swap_task())
>>
>> https://lkml.kernel.org/r/[email protected]
>
> Ah, I see what you mean now... When I read your and Vincent's replies
> earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> but your actual goal was to capure more sites.
>
> Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> in that TASK_ON_RQ_MIGRATING already captures that.
And in case of move_queued_task() this would have to be conditioned on
SCHED_NORMAL.
> An alternative is something like the below, that matches
> deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> the class methods.
>
> Hmm?
>
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
>
> void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (task_on_rq_migrating(p))
> + flags |= ENQUEUE_MIGRATED;
> +
> enqueue_task(rq, p, flags);
>
> p->on_rq = TASK_ON_RQ_QUEUED;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
> lockdep_assert_rq_held(rq);
>
> WARN_ON_ONCE(task_rq(p) != rq);
> - activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> + activate_task(rq, p, ENQUEUE_NOCLOCK);
> check_preempt_curr(rq, p, 0);
> }
Would work too.
IMHO, setting `se->exec_start = 0` for task_on_rq_migrating(p) already
in migrate_task_rq_fair() would have the charm that
entity_is_long_sleeper() would bail out early for these tasks.
On Tue, Mar 21, 2023 at 12:13:06PM +0100, Dietmar Eggemann wrote:
> On 21/03/2023 11:49, Peter Zijlstra wrote:
> > On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> >> On 21/03/2023 11:02, Peter Zijlstra wrote:
> >>> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> >>>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >>>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >>>> after a migration.
> >>>>
> >>>> For fixing this case, we delay the reset of se->exec_start after
> >>>> placing the entity which se->exec_start to detect long sleeping task.
> >>>>
> >>>> In order to take into account a possible divergence between the clock_task
> >>>> of 2 rqs, we increase the threshold to around 104 days.
> >>>>
> >>>>
> >>>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >>>> Signed-off-by: Zhang Qiao <[email protected]>
> >>>> Signed-off-by: Vincent Guittot <[email protected]>
> >>>> ---
> >>>
> >>> Blergh, this just isn't going to be nice. I'll go queue this for
> >>> sched/urgent and then we can forget about this for a little while.
> >>>
> >>> Thanks!
> >>
> >> Don't we miss setting `se->exec_start = 0` for fair task in
> >> move_queued_task()? ( ... and __migrate_swap_task())
> >>
> >> https://lkml.kernel.org/r/[email protected]
> >
> > Ah, I see what you mean now... When I read your and Vincent's replies
> > earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> > but your actual goal was to capure more sites.
> >
> > Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> > in that TASK_ON_RQ_MIGRATING already captures that.
>
> And in case of move_queued_task() this would have to be conditioned on
> SCHED_NORMAL.
I would prefer to not do that -- keep uniform rules. AFAICT the only
other user of ENQUEUE_MIGRATED is deadline and that needs
ENQUEUE_WAKEUP|ENQUEUE_MIGRATED combination to be effective and I don't
think we've added any of those.
> > An alternative is something like the below, that matches
> > deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> > the class methods.
> >
> > Hmm?
> >
> >
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
> >
> > void activate_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > + if (task_on_rq_migrating(p))
> > + flags |= ENQUEUE_MIGRATED;
> > +
> > enqueue_task(rq, p, flags);
> >
> > p->on_rq = TASK_ON_RQ_QUEUED;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
> > lockdep_assert_rq_held(rq);
> >
> > WARN_ON_ONCE(task_rq(p) != rq);
> > - activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> > + activate_task(rq, p, ENQUEUE_NOCLOCK);
> > check_preempt_curr(rq, p, 0);
> > }
>
> Would work too.
OK, let me fold this in and then we can always tinker with it later if
we're so motivated :-)
On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
>
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
>
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
>
>
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
Just noticed, that SoB chain isn't valid, should Zhang be something
like: Originally-by or Suggested-by or whatever?
On Tue, 21 Mar 2023 at 13:28, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> > Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > fixes an overflowing bug, but ignore a case that se->exec_start is reset
> > after a migration.
> >
> > For fixing this case, we delay the reset of se->exec_start after
> > placing the entity which se->exec_start to detect long sleeping task.
> >
> > In order to take into account a possible divergence between the clock_task
> > of 2 rqs, we increase the threshold to around 104 days.
> >
> >
> > Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > Signed-off-by: Zhang Qiao <[email protected]>
> > Signed-off-by: Vincent Guittot <[email protected]>
>
> Just noticed, that SoB chain isn't valid, should Zhang be something
> like: Originally-by or Suggested-by or whatever?
Originally-by would be the right tag
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: a53ce18cacb477dd0513c607f187d16f0fa96f71
Gitweb: https://git.kernel.org/tip/a53ce18cacb477dd0513c607f187d16f0fa96f71
Author: Vincent Guittot <[email protected]>
AuthorDate: Fri, 17 Mar 2023 17:08:10 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 21 Mar 2023 14:43:04 +01:00
sched/fair: Sanitize vruntime of entity being migrated
Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.
For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.
In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.
Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <[email protected]>
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Zhang Qiao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++---------
2 files changed, 47 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 488655f..0d18c39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
+ if (task_on_rq_migrating(p))
+ flags |= ENQUEUE_MIGRATED;
+
enqueue_task(rq, p, flags);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f8..6986ea3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
#endif
}
+static inline bool entity_is_long_sleeper(struct sched_entity *se)
+{
+ struct cfs_rq *cfs_rq;
+ u64 sleep_time;
+
+ if (se->exec_start == 0)
+ return false;
+
+ cfs_rq = cfs_rq_of(se);
+
+ sleep_time = rq_clock_task(rq_of(cfs_rq));
+
+ /* Happen while migrating because of clock task divergence */
+ if (sleep_time <= se->exec_start)
+ return false;
+
+ sleep_time -= se->exec_start;
+ if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
+ return true;
+
+ return false;
+}
+
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;
- u64 sleep_time;
/*
* The 'current' period is already promised to the current tasks,
@@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
/*
* Pull vruntime of the entity being placed to the base level of
- * cfs_rq, to prevent boosting it if placed backwards. If the entity
- * slept for a long time, don't even try to compare its vruntime with
- * the base as it may be too far off and the comparison may get
- * inversed due to s64 overflow.
- */
- sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
- if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+ * cfs_rq, to prevent boosting it if placed backwards.
+ * However, min_vruntime can advance much faster than real time, with
+ * the extreme being when an entity with the minimal weight always runs
+ * on the cfs_rq. If the waking entity slept for a long time, its
+ * vruntime difference from min_vruntime may overflow s64 and their
+ * comparison may get inversed, so ignore the entity's original
+ * vruntime in that case.
+ * The maximal vruntime speedup is given by the ratio of normal to
+ * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
+ * When placing a migrated waking entity, its exec_start has been set
+ * from a different rq. In order to take into account a possible
+ * divergence between new and prev rq's clocks task because of irq and
+ * stolen time, we take an additional margin.
+ * So, cutting off on the sleep time of
+ * 2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
+ * should be safe.
+ */
+ if (entity_is_long_sleeper(se))
se->vruntime = vruntime;
else
se->vruntime = max_vruntime(se->vruntime, vruntime);
@@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
+ /* Entity has migrated, no longer consider this task hot */
+ if (flags & ENQUEUE_MIGRATED)
+ se->exec_start = 0;
check_schedstat_required();
update_stats_enqueue_fair(cfs_rq, se, flags);
@@ -7657,9 +7693,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
/* Tell new CPU we are migrated */
se->avg.last_update_time = 0;
- /* We have migrated, no longer consider this task hot */
- se->exec_start = 0;
-
update_scan_period(p, new_cpu);
}
On 2023-03-17 at 17:08:10 +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
>
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
>
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
>
>
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <[email protected]>
> Signed-off-by: Vincent Guittot <[email protected]>
>
This patch has been confirmed by 0day to restore the performance which was decreased
by 829c1651e9c4[1], thanks for the fix.
commit: a2e90611b9f425ad 829c1651e9c4a6f78398d3e6765 a53ce18cacb477dd0513c607f18
hackbench
throughput: 173811 -18.4% 141887 +1.4% 176324
[1] https://lore.kernel.org/lkml/[email protected]/
thanks,
Chenyu